Reading and Writing Text Data in an Object Store

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

The PXF object store connectors support plain delimited and comma-separated value format text data. This section describes how to use PXF to access text data in an object store, including how to create, query, and insert data into an external table that references files in the object store.

Note: Accessing text data from an object store is very similar to accessing text data in HDFS.

Prerequisites

Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from or write data to an object store.

Reading Text Data

Use the <objstore>:text profile when you read plain text delimited or .csv data from an object store where each row is a single record. PXF supports the following <objstore> profile prefixes:

Object Store Profile Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3

The following syntax creates a Greenplum Database readable external table that references a simple text file in an object store: 

CREATE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑file> The absolute path to the directory or file in the S3 object store.
PROFILE=<objstore>:text The PROFILE keyword must identify the specific object store. For example, s3:text.
SERVER=<server_name> The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified.
FORMAT Use FORMAT 'TEXT' when <path-to-file> references plain text delimited data.
Use FORMAT 'CSV' when <path-to-file> references comma-separated value data.
delimiter The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').

If you are accessing an S3 object store, you can provide S3 credentials directly in the CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration.

Example: Reading Text Data from S3

Perform the following procedure to create a sample text file, copy the file to S3, and use the s3:text profile to create two PXF external tables to query the data.

To run this example, you must:

  • Have the AWS CLI tools installed on your system
  • Know your AWS access ID and secret key
  • Have write permission to an S3 bucket
  1. Create a directory in S3 for PXF example data files. For example, if you have write access to an S3 bucket named BUCKET:

    $ aws s3 mb s3://BUCKET/pxf_examples
    
  2. Locally create a delimited plain text data file named pxf_s3_simple.txt:

    $ echo 'Prague,Jan,101,4875.33
    Rome,Mar,87,1557.39
    Bangalore,May,317,8936.99
    Beijing,Jul,411,11600.67' > /tmp/pxf_s3_simple.txt
    

    Note the use of the comma , to separate the four data fields.

  3. Copy the data file to the S3 directory you created in Step 1:

    $ aws s3 cp /tmp/pxf_s3_simple.txt s3://BUCKET/pxf_examples/
    
  4. Verify that the file now resides in S3:

    $ aws s3 ls s3://BUCKET/pxf_examples/pxf_s3_simple.txt
    
  5. Start the psql subsystem:

    $ psql -d postgres
    
  6. Use the PXF s3:text profile to create a Greenplum Database external table that references the pxf_s3_simple.txt file that you just created and added to S3. For example, if your server name is s3srvcfg:

    postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
              FORMAT 'TEXT' (delimiter=E',');
    
  7. Query the external table:

    postgres=# SELECT * FROM pxf_s3_textsimple;          
    
       location    | month | num_orders | total_sales 
    ---------------+-------+------------+-------------
     Prague        | Jan   |        101 |     4875.33
     Rome          | Mar   |         87 |     1557.39
     Bangalore     | May   |        317 |     8936.99
     Beijing       | Jul   |        411 |    11600.67
    (4 rows)
    
  8. Create a second external table that references pxf_s3_simple.txt, this time specifying the CSV FORMAT:

    postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple_csv(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
              FORMAT 'CSV';
    postgres=# SELECT * FROM pxf_s3_textsimple_csv;          
    

    When you specify FORMAT 'CSV' for comma-separated value data, no delimiter formatter option is required because comma is the default delimiter value.

Reading Text Data with Quoted Linefeeds

Use the <objstore>:text:multi profile to read plain text data with delimited single- or multi- line records that include embedded (quoted) linefeed characters. The following syntax creates a Greenplum Database readable external table that references such a text file in an object store:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:text:multi[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑file> The absolute path to the directory or file in the S3 data store.
PROFILE=<objstore>:text:multi The PROFILE keyword must identify the specific object store. For example, s3:text:multi.
SERVER=<server_name> The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified.
FORMAT Use FORMAT 'TEXT' when <path-to-file> references plain text delimited data.
Use FORMAT 'CSV' when <path-to-file> references comma-separated value data.
delimiter The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').

If you are accessing an S3 object store, you can provide S3 credentials directly in the CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration.

Example: Reading Multi-Line Text Data from S3

Perform the following steps to create a sample text file, copy the file to S3, and use the PXF s3:text:multi profile to create a Greenplum Database readable external table to query the data.

To run this example, you must:

  • Have the AWS CLI tools installed on your system
  • Know your AWS access ID and secret key
  • Have write permission to an S3 bucket
  1. Create a second delimited plain text file:

    $ vi /tmp/pxf_s3_multi.txt
    
  2. Copy/paste the following data into pxf_s3_multi.txt:

    "4627 Star Rd.
    San Francisco, CA  94107":Sept:2017
    "113 Moon St.
    San Diego, CA  92093":Jan:2018
    "51 Belt Ct.
    Denver, CO  90123":Dec:2016
    "93114 Radial Rd.
    Chicago, IL  60605":Jul:2017
    "7301 Brookview Ave.
    Columbus, OH  43213":Dec:2018
    

    Notice the use of the colon : to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line feed separating the street address from the city and state.

  3. Copy the text file to S3:

    $ aws s3 cp /tmp/pxf_s3_multi.txt s3://BUCKET/pxf_examples/
    
  4. Use the s3:text:multi profile to create an external table that references the pxf_s3_multi.txt S3 file, making sure to identify the : (colon) as the field separator. For example, if your server name is s3srvcfg:

    postgres=# CREATE EXTERNAL TABLE pxf_s3_textmulti(address text, month text, year int)
                LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_multi.txt?PROFILE=s3:text:multi&SERVER=s3srvcfg')
              FORMAT 'CSV' (delimiter ':');
    

    Notice the alternate syntax for specifying the delimiter.

  5. Query the pxf_s3_textmulti table:

    postgres=# SELECT * FROM pxf_s3_textmulti;
    
             address          | month | year 
    --------------------------+-------+------
     4627 Star Rd.            | Sept  | 2017
     San Francisco, CA  94107           
     113 Moon St.             | Jan   | 2018
     San Diego, CA  92093               
     51 Belt Ct.              | Dec   | 2016
     Denver, CO  90123                  
     93114 Radial Rd.         | Jul   | 2017
     Chicago, IL  60605                 
     7301 Brookview Ave.      | Dec   | 2018
     Columbus, OH  43213                
    (5 rows)
    

Writing Text Data

The “<objstore>:text” profiles support writing single line plain text data to an object store. When you create a writable external table with PXF, you specify the name of a directory. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

Note: External tables that you create with a writable profile can only be used for INSERT operations. If you want to query the data that you inserted, you must create a separate readable external table that references the directory.

Use the following syntax to create a Greenplum Database writable external table that references an object store directory: 

CREATE WRITABLE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-dir>
    ?PROFILE=<objstore>:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑dir> The absolute path to the directory in the S3 data store.
PROFILE=<objstore>:text The PROFILE keyword must identify the specific object store. For example, s3:text.
SERVER=<server_name> The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified.
<custom‑option>=<value> <custom-option>s are described below.
FORMAT Use FORMAT 'TEXT' to write plain, delimited text to <path-to-dir>.
Use FORMAT 'CSV' to write comma-separated value text to <path-to-dir>.
delimiter The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
DISTRIBUTED BY If you plan to load the writable external table with data from an existing Greenplum Database table, consider specifying the same distribution policy or <column_name> on the writable external table as that defined for the table from which you plan to load the data. Doing so will avoid extra motion of data between segments on the load operation.

Writable external tables that you create using an <objstore>:text profile can optionally use record or block compression. The PXF <objstore>:text profiles support the following compression codecs:

  • org.apache.hadoop.io.compress.DefaultCodec
  • org.apache.hadoop.io.compress.GzipCodec
  • org.apache.hadoop.io.compress.BZip2Codec

You specify the compression codec via custom options in the CREATE EXTERNAL TABLE LOCATION clause. The <objstore>:text profile support the following custom write options:

Option Value Description
COMPRESSION_CODEC The compression codec Java class name. If this option is not provided, Greenplum Database performs no data compression. Supported compression codecs include:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.GzipCodec
COMPRESSION_TYPE The compression type to employ; supported values are RECORD (the default) or BLOCK.
THREAD-SAFE Boolean value determining if a table query can run in multi-threaded mode. The default value is TRUE. Set this option to FALSE to handle all requests in a single thread for operations that are not thread-safe (for example, compression).

If you are accessing an S3 object store, you can provide S3 credentials directly in the CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration.

Example: Writing Text Data to S3

This example utilizes the data schema introduced in Example: Reading Text Data from S3.

Column Name Data Type
location text
month text
number_of_orders int
total_sales float8

This example also optionally uses the Greenplum Database external table named pxf_s3_textsimple that you created in that exercise.

Procedure

Perform the following procedure to create Greenplum Database writable external tables utilizing the same data schema as described above, one of which will employ compression. You will use the PXF s3:text profile to write data to S3. You will also create a separate, readable external table to read the data that you wrote to S3.

  1. Create a Greenplum Database writable external table utilizing the data schema described above. Write to the S3 directory BUCKET/pxf_examples/pxfwrite_s3_textsimple1. Create the table specifying a comma , as the delimiter. For example, if your server name is s3srvcfg:

    postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_s3_writetbl_1(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple1?PROFILE=s3:text&SERVER=s3srvcfg')
              FORMAT 'TEXT' (delimiter=',');
    

    You specify the FORMAT subclause delimiter value as the single ascii comma character ,.

  2. Write a few individual records to the pxfwrite_s3_textsimple1 S3 directory by invoking the SQL INSERT command on pxf_s3_writetbl_1:

    postgres=# INSERT INTO pxf_s3_writetbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
    postgres=# INSERT INTO pxf_s3_writetbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
    
  3. (Optional) Insert the data from the pxf_s3_textsimple table that you created in Example: Reading Text Data from S3 into pxf_s3_writetbl_1:

    postgres=# INSERT INTO pxf_s3_writetbl_1 SELECT * FROM pxf_s3_textsimple;
    
  4. Greenplum Database does not support directly querying a writable external table. To query the data that you just added to S3, you must create a readable external Greenplum Database table that references the S3 directory:

    postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple_r1(location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple1?PROFILE=s3:text&SERVER=s3srvcfg')
                FORMAT 'CSV';
    

    You specify the 'CSV' FORMAT when you create the readable external table because you created the writable table with a comma , as the delimiter character, the default delimiter for 'CSV' FORMAT.

  5. Query the readable external table:

    postgres=# SELECT * FROM pxf_s3_textsimple_r1 ORDER BY total_sales;
    
     location  | month | num_orders | total_sales 
    -----------+-------+------------+-------------
     Rome      | Mar   |         87 |     1557.39
     Frankfurt | Mar   |        777 |     3956.98
     Prague    | Jan   |        101 |     4875.33
     Bangalore | May   |        317 |     8936.99
     Beijing   | Jul   |        411 |    11600.67
     Cleveland | Oct   |       3812 |    96645.37
    (6 rows)
    

    The pxf_s3_textsimple_r1 table includes the records you individually inserted, as well as the full contents of the pxf_s3_textsimple table if you performed the optional step.

  6. Create a second Greenplum Database writable external table, this time using Gzip compression and employing a colon : as the delimiter:

    postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_s3_writetbl_2 (location text, month text, num_orders int, total_sales float8)
                LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple2?PROFILE=s3:text&SERVER=s3srvcfg&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')
              FORMAT 'TEXT' (delimiter=':');
    
  7. Write a few records to the pxfwrite_s3_textsimple2 S3 directory by inserting directly into the pxf_s3_writetbl_2 table:

    gpadmin=# INSERT INTO pxf_s3_writetbl_2 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
    gpadmin=# INSERT INTO pxf_s3_writetbl_2 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
    
  8. To query data from the newly-created S3 directory named pxfwrite_s3_textsimple2, you can create a readable external Greenplum Database table as described above that references this S3 directory and specifies FORMAT 'CSV' (delimiter=':').