Reading HDFS File Data with PXF

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

HDFS is the primary distributed storage mechanism used by Apache Hadoop applications. The PXF HDFS connector reads file data stored in HDFS. The connector supports plain delimited and comma-separated-value format text files. The HDFS connector also supports the Avro binary format.

This section describes how to use PXF to access HDFS data, including how to create and query an external table referencing files in the HDFS data store.

Prerequisites

Before working with HDFS file data using PXF, ensure that:

  • You have installed and configured a Hadoop client on each Greenplum Database segment host. Refer to Installing and Configuring the Hadoop Client for PXF for instructions.
  • You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See Configuring, Initializing, and Starting PXF for PXF initialization, configuration, and startup information.
  • You have granted the gpadmin user read permission on the relevant portions of your HDFS file system.

If you plan to access Avro format files, make sure that you have downloaded the required JAR file as described in Installing and Configuring the Hadoop Client for PXF. You must restart PXF if you download and install this file while PXF is running.

HDFS File Formats

The PXF HDFS connector supports reading the following file formats:

  • Text - comma-separated value (.csv) or delimited format plain text file
  • Avro - JSON-defined, schema-based data serialization format

The PXF HDFS connector provides the following profiles to read the file formats listed above:

Data Format Profile Name(s) Description
Text HdfsTextSimple Read delimited single line records from plain text files on HDFS.
Text HdfsTextMulti Read delimited single or multi-line records with quoted linefeeds from plain text files on HDFS.
Avro Avro Read Avro format binary files (<filename>.avro).

HDFS Shell Command Primer

Hadoop includes command-line tools that interact directly with your HDFS file system. These tools support typical file system operations including copying and listing files, changing file permissions, and so forth.

The HDFS file system command syntax is hdfs dfs <options> [<file>]. Invoked with no options, hdfs dfs lists the file system options supported by the tool.

The user invoking the hdfs dfs command must have read privileges on the HDFS data store to list and view file contents, and write permission to create directories and files.

The hdfs dfs options used in this topic are:

Option Description
-cat Display file contents.
-mkdir Create a directory in HDFS.
-put Copy a file from the local file system to HDFS.

Examples:

Create a directory in HDFS:

$ hdfs dfs -mkdir -p /data/exampledir

Copy a text file from your local file system to HDFS:

$ hdfs dfs -put /tmp/example.txt /data/exampledir/

Display the contents of a text file located in HDFS:

$ hdfs dfs -cat /data/exampledir/example.txt

Querying External HDFS Data

The PXF HDFS connector supports the HdfsTextSimple, HdfsTextMulti, and Avro profiles.

Use the following syntax to create a Greenplum Database external table referencing HDFS data: 

CREATE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>
    ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro[&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);

The specific keywords and values used by the pxf protocol in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path-to-hdfs-file> The the full file system path to the file in the HDFS data store.
PROFILE The PROFILE keyword must specify one of the values HdfsTextSimple, HdfsTextMulti, or Avro.
<custom-option> <custom-option> is profile-specific. Profile-specific options are discussed in the relevant sections later in this topic.
FORMAT Use FORMAT 'TEXT' with the HdfsTextSimple profile when <path-to-hdfs-file> references a plain text delimited file.
Use FORMAT 'CSV' with the HdfsTextSimple or HdfsTextMulti profile when <path-to-hdfs-file> references a comma-separated value file.
FORMAT ‘CUSTOM’ Use FORMATCUSTOM’ with the Avro profile. The Avro 'CUSTOM' FORMAT requires the built-in (formatter='pxfwritable_import') <formatting-property>
<formatting-properties> <formatting-properties> are profile-specific. Profile-specific formatting options are identified in the relevant sections later in this topic.

Note: When creating PXF external tables, you cannot use the HEADER option in your FORMAT specification.

Reading Text Files

Use the HdfsTextSimple profile when you read a plain text delimited or .csv file where each row is a single record.

<formatting-properties> supported by the HdfsTextSimple profile include:

Keyword Syntax, Example(s) Description
delimiter (delimiter=E'\t')
(delimiter ':')
The delimiter character in the file. For FORMAT 'CSV', the default value is a comma ,. Preface the value with an E when the value is an escape sequence.

Example: Using the HdfsTextSimple Profile

Perform the following procedure to create a sample text file, copy the file to HDFS, and use the HdfsTextSimple profile to create two PXF external tables to query the data:

  1. Create an HDFS directory for PXF example data files. For example:

    $ hdfs dfs -mkdir -p /data/pxf_examples
    
  2. Create a delimited plain text data file named pxf_hdfs_simple.txt:

    $ echo 'Prague,Jan,101,4875.33
    Rome,Mar,87,1557.39
    Bangalore,May,317,8936.99
    Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt
    

    Note the use of the comma , to separate the four data fields.

  3. Add the data file to HDFS:

    $ hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
    
  4. Display the contents of the pxf_hdfs_simple.txt file stored in HDFS:

    $ hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
    
  5. Start the psql subsystem:

    $ psql -d postgres
    
  6. Use the PXF HdfsTextSimple profile to create a Greenplum Database external table referencing the pxf_hdfs_simple.txt file you just created and added to HDFS:

    postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
              FORMAT 'TEXT' (delimiter=E',');
    
  7. Query the external table:

    postgres=# SELECT * FROM pxf_hdfs_textsimple;          
    
       location    | month | num_orders | total_sales 
    ---------------+-------+------------+-------------
     Prague        | Jan   |        101 |     4875.33
     Rome          | Mar   |         87 |     1557.39
     Bangalore     | May   |        317 |     8936.99
     Beijing       | Jul   |        411 |    11600.67
    (4 rows)
    
  8. Create a second external table referencing pxf_hdfs_simple.txt, this time specifying the CSV FORMAT:

    postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
              FORMAT 'CSV';
    postgres=# SELECT * FROM pxf_hdfs_textsimple_csv;          
    

    When specifying FORMAT 'CSV' for a comma-separated value file, no delimiter formatter option is required because comma is the default delimiter value.

Reading Text Files with Quoted Linefeeds

Use the HdfsTextMulti profile to read plain text files with delimited single- or multi- line records that include embedded (quoted) linefeed characters.

<formatting-properties> supported by the HdfsTextMulti profile include:

Keyword Syntax, Example(s) Value
delimiter (delimiter=E'\t')
(delimiter ':')
The delimiter character in the file. For FORMAT 'CSV', the default value is a comma ,. Preface the value with an E when the value is an escape sequence.

Example: Using the HdfsTextMulti Profile

Perform the following steps to create a sample text file, copy the file to HDFS, and use the PXF HdfsTextMulti profile to create a Greenplum Database external table to query the data:

  1. Create a second delimited plain text file:

    $ vi /tmp/pxf_hdfs_multi.txt
    
  2. Copy/paste the following data into pxf_hdfs_multi.txt:

    "4627 Star Rd.
    San Francisco, CA  94107":Sept:2017
    "113 Moon St.
    San Diego, CA  92093":Jan:2018
    "51 Belt Ct.
    Denver, CO  90123":Dec:2016
    "93114 Radial Rd.
    Chicago, IL  60605":Jul:2017
    "7301 Brookview Ave.
    Columbus, OH  43213":Dec:2018
    

    Notice the use of the colon : to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line feed separating the street address from the city and state.

  3. Copy the text file to HDFS:

    $ hdfs dfs -put /tmp/pxf_hdfs_multi.txt /data/pxf_examples/
    
  4. Use the HdfsTextMulti profile to create an external table referencing the pxf_hdfs_multi.txt HDFS file, making sure to identify the : (colon) as the field separator:

    postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int)
                LOCATION ('pxf://data/pxf_examples/pxf_hdfs_multi.txt?PROFILE=HdfsTextMulti')
              FORMAT 'CSV' (delimiter ':');
    

    Notice the alternate syntax for specifying the delimiter.

  5. Query the pxf_hdfs_textmulti table:

    postgres=# SELECT * FROM pxf_hdfs_textmulti;
    
             address          | month | year 
    --------------------------+-------+------
     4627 Star Rd.            | Sept  | 2017
     San Francisco, CA  94107           
     113 Moon St.             | Jan   | 2018
     San Diego, CA  92093               
     51 Belt Ct.              | Dec   | 2016
     Denver, CO  90123                  
     93114 Radial Rd.         | Jul   | 2017
     Chicago, IL  60605                 
     7301 Brookview Ave.      | Dec   | 2018
     Columbus, OH  43213                
    (5 rows)
    

Reading Avro Files

Apache Avro is a data serialization framework where the data is serialized in a compact binary format. Avro specifies that data types be defined in JSON. Avro format files have an independent schema, also defined in JSON. An Avro schema, together with its data, is fully self-describing.

Data Type Mapping

Avro supports both primitive and complex data types.

To represent Avro primitive data types in Greenplum Database, map data values to Greenplum Database columns of the same type.

Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to the Greenplum Database TEXT type. While Greenplum Database does not natively support these types, you can create Greenplum Database functions or application code to extract or further process subcomponents of these complex data types.

The following table summarizes external mapping rules for Avro data.

Avro Data Type PXF/Greenplum Data Type
boolean boolean
bytes bytea
double double
float real
int int or smallint
long bigint
string text
Complex type: Array, Map, Record, or Enum text, with delimiters inserted between collection items, mapped key-value pairs, and record data.
Complex type: Fixed bytea
Union Follows the above conventions for primitive or complex data types, depending on the union; supports Null values.

Avro-Specific Options

For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific Avro custom options in the CREATE EXTERNAL TABLE command.

The Avro profile supports the following <custom-option>s:

Option Name Description
COLLECTION_DELIM The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps an Avro complex data type to a text column. The default is the comma , character.
MAPKEY_DELIM The delimiter character(s) to place between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is the colon : character.
RECORDKEY_DELIM The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text column. The default is the colon : character.

The <formatting-properties> supported by the Avro profile include:

Keyword Syntax, Example Description
formatter (formatter='pxfwritable_import') Must identify pxfwritable_import.

Avro Schemas and Data

Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data type mapping section above. Avro schema files typically have a .avsc suffix.

Fields in an Avro schema file are defined via an array of objects, each of which is specified by a name and a type.

Example: Using the Avro Profile

The examples in this section will operate on Avro data with the following field name and data type record schema:

  • id - long
  • username - string
  • followers - array of string
  • fmap - map of long
  • address - record comprised of street number (int), street name (string), and city (string)
  • relationship - enumerated type

Create Schema

Perform the following operations to create an Avro schema to represent the example schema described above.

  1. Create a file named avro_schema.avsc:

    $ vi /tmp/avro_schema.avsc
    
  2. Copy and paste the following text into avro_schema.avsc:

    {
    "type" : "record",
      "name" : "example_schema",
      "namespace" : "com.example",
      "fields" : [ {
        "name" : "id",
        "type" : "long",
        "doc" : "Id of the user account"
      }, {
        "name" : "username",
        "type" : "string",
        "doc" : "Name of the user account"
      }, {
        "name" : "followers",
        "type" : {"type": "array", "items": "string"},
        "doc" : "Users followers"
      }, {
        "name": "fmap",
        "type": {"type": "map", "values": "long"}
      }, {
        "name": "relationship",
        "type": {
            "type": "enum",
            "name": "relationshipEnum",
            "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
        }
      }, {
        "name": "address",
        "type": {
            "type": "record",
            "name": "addressRecord",
            "fields": [
                {"name":"number", "type":"int"},
                {"name":"street", "type":"string"},
                {"name":"city", "type":"string"}]
        }
      } ],
      "doc:" : "A basic schema for storing messages"
    }
    

Create Avro Data File (JSON)

Perform the following steps to create a sample Avro data file conforming to the above schema.

  1. Create a text file named pxf_hdfs_avro.txt:

    $ vi /tmp/pxf_hdfs_avro.txt
    
  2. Enter the following data into pxf_hdfs_avro.txt:

    {"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san jose"}}
    
    {"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
    

    The sample data uses a comma , to separate top level records and a colon : to separate map/key values and record field name/values.

  3. Convert the text file to Avro format. There are various ways to perform the conversion, both programmatically and via the command line. In this example, we use the Java Avro tools; the jar avro-tools-1.8.1.jar file resides in the current directory:

    $ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro
    

    The generated Avro binary data file is written to /tmp/pxf_hdfs_avro.avro.

  4. Copy the generated Avro file to HDFS:

    $ hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
    

Query With Avro Profile

Perform the following steps to create and query an external table referencing the pxf_hdfs_avro.avro file that you added to HDFS in the previous section. When creating the table:

  • Map the top-level primitive fields, id (type long) and username (type string), to their equivalent Greenplum Database types (bigint and text).
  • Map the remaining complex fields to type text.
  • Explicitly set the record, map, and collection delimiters using the Avro profile custom options.
  1. Use the Avro profile to create a queryable external table from the pxf_hdfs_avro.avro file:

    postgres=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text, fmap text, relationship text, address text)
                LOCATION ('pxf://data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
              FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    
  2. Perform a simple query of the pxf_hdfs_avro table:

    postgres=# SELECT * FROM pxf_hdfs_avro;
    
     id | username |   followers    |        fmap         | relationship |                      address                      
    ----+----------+----------------+--------------------+--------------+---------------------------------------------------
      1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | {number:1,street:renaissance drive,city:san jose}
      2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | {number:9,street:deer creek,city:palo alto}
    (2 rows)
    

    The simple query of the external table shows the components of the complex type data separated with the delimiters specified in the CREATE EXTERNAL TABLE call.

  3. Process the delimited components in the text columns as necessary for your application. For example, the following command uses the Greenplum Database internal string_to_array function to convert entries in the followers field to a text array column in a new view.

    postgres=# CREATE VIEW followers_view AS 
    SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers) - 2)), ',')::text[] 
        AS followers 
      FROM pxf_hdfs_avro;
    
  4. Query the view to filter rows based on whether a particular follower appears in the view:

    postgres=# SELECT username, address FROM followers_view WHERE followers @> '{john}';
    
     username |                   address                   
    ----------+---------------------------------------------
     jim      | {number:9,street:deer creek,city:palo alto}