Reading a Multi-Line Text File into a Single Table Row

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.

You can use the PXF HDFS connector to read one or more multi-line text files in HDFS each as a single table row. This may be useful when you want to read multiple files into the same Greenplum Database external table, for example when individual JSON files each contain a separate record.

PXF supports reading only text and JSON files in this manner.

Note: Refer to the Reading JSON Data from HDFS topic if you want to use PXF to read JSON files that include more than one record.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read files from HDFS.

Reading Multi-Line Text and JSON Files

You can read single- and multi-line files into a single table row, including files with embedded linefeeds. If you are reading multiple JSON files, each file must be a complete record, and each file must contain the same record type.

PXF reads the complete file data into a single row and column. When you create the external table to read multiple files, you must ensure that all of the files that you want to read are of the same (text or JSON) type. You must also specify a single text or json column, depending upon the file type.

The following syntax creates a Greenplum Database readable external table that references one or more text or JSON files on HDFS:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> text|json | LIKE <other_table> )
  LOCATION ('pxf://<path-to-files>?PROFILE=hdfs:text:multi[&SERVER=<server_name>]&FILE_AS_ROW=true')
FORMAT 'CSV');

The keywords and values used in this CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑files> The absolute path to the directory or files in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:text:multi.
SERVER=<server_name> The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified.
FILE_AS_ROW=true The required option that instructs PXF to read each file into a single table row.
FORMAT The FORMAT must specify 'CSV'.

Note: The hdfs:text:multi profile does not support additional format options when you specify the FILE_AS_ROW=true option.

For example, if /data/pxf_examples/jdir identifies an HDFS directory that contains a number of JSON files, the following statement creates a Greenplum Database external table that references all of the files in that directory:

CREATE EXTERNAL TABLE pxf_readjfiles(j1 json)
  LOCATION ('pxf://data/pxf_examples/jdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';

When you query the pxf_readjfiles table with a SELECT statement, PXF returns the contents of each JSON file in jdir/ as a separate row in the external table.

When you read JSON files, you can use the JSON functions provided in Greenplum Database to access individual data fields in the JSON record. For example, if the pxf_readjfiles external table above reads a JSON file that contains this JSON record:

{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      }
    }
  ]
}

You can use the json_array_elements() function to extract specific JSON fields from the table row. For example, the following command displays the user->id field:

SELECT json_array_elements(j1->'root')->'record_obj'->'user'->'id'
  AS userid FROM pxf_readjfiles;

  userid  
----------
 31424214
(1 rows)

Refer to Working with JSON Data for specific information on manipulating JSON data with Greenplum Database.

Example: Reading an HDFS Text File into a Single Table Row

Perform the following procedure to create 3 sample text files in an HDFS directory, and use the PXF hdfs:text:multi profile and the default PXF server to read all of these text files in a single external table query.

  1. Create an HDFS directory for the text files. For example:

    $ hdfs dfs -mkdir -p /data/pxf_examples/tdir
    
  2. Create a text data file named file1.txt:

    $ echo 'text file with only one line' > /tmp/file1.txt
    
  3. Create a second text data file named file2.txt:

    $ echo 'Prague,Jan,101,4875.33
    Rome,Mar,87,1557.39
    Bangalore,May,317,8936.99
    Beijing,Jul,411,11600.67' > /tmp/file2.txt
    

    This file has multiple lines.

  4. Create a third text file named /tmp/file3.txt:

    $ echo '"4627 Star Rd.
    San Francisco, CA  94107":Sept:2017
    "113 Moon St.
    San Diego, CA  92093":Jan:2018
    "51 Belt Ct.
    Denver, CO  90123":Dec:2016
    "93114 Radial Rd.
    Chicago, IL  60605":Jul:2017
    "7301 Brookview Ave.
    Columbus, OH  43213":Dec:2018' > /tmp/file3.txt
    

    This file includes embedded line feeds.

  5. Save the file and exit the editor.

  6. Copy the text files to HDFS:

    $ hdfs dfs -put /tmp/file1.txt /data/pxf_examples/tdir
    $ hdfs dfs -put /tmp/file2.txt /data/pxf_examples/tdir
    $ hdfs dfs -put /tmp/file3.txt /data/pxf_examples/tdir
    
  7. Log in to a Greenplum Database system and start the psql subsystem.

  8. Use the hdfs:text:multi profile to create an external table that references the tdir HDFS directory. For example:

    CREATE EXTERNAL TABLE pxf_readfileasrow(c1 text)
      LOCATION ('pxf://data/pxf_examples/tdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
    FORMAT 'CSV';
    
  9. Turn on expanded display and query the pxf_readfileasrow table:

    postgres=# \x on
    postgres=# SELECT * FROM pxf_readfileasrow;
    
    -[ RECORD 1 ]---------------------------
    c1 | Prague,Jan,101,4875.33
       | Rome,Mar,87,1557.39
       | Bangalore,May,317,8936.99
       | Beijing,Jul,411,11600.67
    -[ RECORD 2 ]---------------------------
    c1 | text file with only one line
    -[ RECORD 3 ]---------------------------
    c1 | "4627 Star Rd.
       | San Francisco, CA  94107":Sept:2017
       | "113 Moon St.
       | San Diego, CA  92093":Jan:2018
       | "51 Belt Ct.
       | Denver, CO  90123":Dec:2016
       | "93114 Radial Rd.
       | Chicago, IL  60605":Jul:2017
       | "7301 Brookview Ave.
       | Columbus, OH  43213":Dec:2018