Reading CSV and Parquet Data from S3 Using S3 Select

The PXF S3 connector supports reading certain CSV- and Parquet-format data from S3 using the Amazon S3 Select service. S3 Select provides direct query-in-place features on data stored in Amazon S3.

When you enable it, PXF uses S3 Select to filter the contents of S3 objects to retrieve the subset of data that you request. This typically reduces both the amount of data transferred to Greenplum Database and the query time.

You can use the PXF S3 Connector with S3 Select to read:

  • gzip- or bzip2-compressed CSV files
  • Parquet files with gzip- or snappy-compressed columns

The data must be UTF-8-encoded, and may be server-side encrypted.

PXF supports column projection as well as predicate pushdown for AND, OR, and NOT operators when using S3 Select.

Using the Amazon S3 Select service may increase the cost of data access and retrieval. Be sure to consider the associated costs before you enable PXF to use the S3 Select service.

Enabling PXF to Use S3 Select

The S3_SELECT external table custom option governs PXF’s use of S3 Select when accessing the S3 object store. You can provide the following values when you set the S3_SELECT option:

S3-SELECT Value Description
OFF PXF does not use S3 Select; the default.
ON PXF always uses S3 Select.
AUTO PXF uses S3 Select when it will benefit access or performance.

By default, PXF does not use S3 Select (S3_SELECT=OFF). You can enable PXF to always use S3 Select, or to use S3 Select only when PXF determines that it could be beneficial for performance. For example, when S3_SELECT=AUTO, PXF automatically uses S3 Select when a query on the external table utilizes column projection or predicate pushdown, or when the referenced CSV file has a header row.

Reading Parquet Data with S3 Select

PXF supports reading Parquet data from S3 as described in Reading and Writing Parquet Data in an Object Store. If you want PXF to use S3 Select when reading the Parquet data, you add the S3_SELECT custom option and value to the CREATE EXTERNAL TABLE LOCATION URI.

Specifying the Parquet Column Compression Type

If columns in the Parquet file are gzip- or snappy-compressed, use the COMPRESSION_CODEC custom option in the LOCATION URI to identify the compression codec alias. For example:

&COMPRESSION_CODEC=gzip

Or,

&COMPRESSION_CODEC=snappy

Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a Parquet file on S3 that you want PXF to access with the S3 Select service:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
  LOCATION ('pxf://<path-to-file>?PROFILE=s3:parquet&SERVER=<server_name>&S3_SELECT=ON|AUTO[&<other-custom-option>=<value>[...]]')
FORMAT 'CSV';
You must specify FORMAT 'CSV' when you enable PXF to use S3 Select on an external table that accesses a Parquet file on S3.

For example, use the following command to have PXF use S3 Select to access a Parquet file on S3 when optimal:

CREATE EXTERNAL TABLE parquet_on_s3 ( LIKE table1 )
  LOCATION ('pxf://bucket/file.parquet?PROFILE=s3:parquet&SERVER=s3srvcfg&S3_SELECT=AUTO')
FORMAT 'CSV';

Reading CSV files with S3 Select

PXF supports reading CSV data from S3 as described in Reading and Writing Text Data in an Object Store. If you want PXF to use S3 Select when reading the CSV data, you add the S3_SELECT custom option and value to the CREATE EXTERNAL TABLE LOCATION URI. You may also specify the delimiter formatter option and the file header and compression custom options.

Handling the CSV File Header

PXF can read a CSV file with a header row only when the S3 Connector uses the Amazon S3 Select service to access the file on S3. PXF does not support reading a CSV file that includes a header row from any other external data store.

CSV files may include a header line. When you enable PXF to use S3 Select to access a CSV-format file, you use the FILE_HEADER custom option in the LOCATION URI to identify whether or not the CSV file has a header row and, if so, how you want PXF to handle the header. PXF never returns the header row.

Note: You must specify S3_SELECT=ON or S3_SELECT=AUTO when the CSV file has a header row. Do not specify S3_SELECT=OFF in this case.

The FILE_HEADER option takes the following values:

FILE_HEADER Value Description
NONE The file has no header row; the default.
IGNORE The file has a header row; ignore the header. Use when the order of the columns in the external table and the CSV file are the same. (When the column order is the same, the column names and the CSV header names may be different.)
USE The file has a header row; read the header. Use when the external table column names and the CSV header names are the same, but are in a different order.

If both the order and the names of the external table columns and the CSV header are the same, you can specify either FILE_HEADER=IGNORE or FILE_HEADER=USE.

PXF cannot match the CSV data with the external table definition when both the order and the names of the external table columns are different from the CSV header columns. Any query on an external table with these conditions fails with the error Some headers in the query are missing from the file.

For example, if the order of the columns in the CSV file header and the external table are the same, add the following to the CREATE EXTERNAL TABLE LOCATION URI to have PXF ignore the CSV header:

&FILE_HEADER=IGNORE

Specifying the CSV File Compression Type

If the CSV file is gzip- or bzip2-compressed, use the COMPRESSION_CODEC custom option in the LOCATION URI to identify the compression codec alias. For example:

&COMPRESSION_CODEC=gzip

Or,

&COMPRESSION_CODEC=bzip2

Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a CSV file on S3 that you want PXF to access with the S3 Select service:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>
    ?PROFILE=s3:text&SERVER=<server_name>&S3_SELECT=ON|AUTO[&FILE_HEADER=IGNORE|USE][&COMPRESSION_CODEC=gzip|bzip2][&<other-custom-option>=<value>[...]]')
FORMAT 'CSV' [(delimiter '<delim_char>')];

Note: Do not use the (HEADER) formatter option in the CREATE EXTERNAL TABLE command.

For example, use the following command to have PXF always use S3 Select to access a gzip-compressed file on S3, where the field delimiter is a pipe (’|’) character and the external table and CSV header columns are in the same order.

CREATE EXTERNAL TABLE gzippedcsv_on_s3 ( LIKE table2 )
  LOCATION ('pxf://bucket/file.csv.gz?PROFILE=s3:text&SERVER=s3srvcfg&S3_SELECT=ON&FILE_HEADER=USE')
FORMAT 'CSV' (delimiter '|');