Reading CSV and Parquet Data from S3 Using S3 Select

The PXF S3 connector supports reading certain CSV- and Parquet-format data from S3 using the Amazon S3 Select service. S3 Select provides direct query-in-place features on data stored in Amazon S3.

When you enable it, PXF uses S3 Select to filter the contents of S3 objects to retrieve the subset of data that you request. This typically reduces both the amount of data transferred to Greenplum Database and the query time.

You can use the PXF S3 Connector with S3 Select to read:

  • gzip- or bzip2-compressed CSV files
  • Parquet files with gzip- or snappy-compressed columns

The data must be UTF-8-encoded, and may be server-side encrypted.

PXF supports column projection as well as predicate pushdown for AND, OR, and NOT operators when using S3 Select.

Using the Amazon S3 Select service may increase the cost of data access and retrieval. Be sure to consider the associated costs before you enable PXF to use the S3 Select service.

Enabling PXF to Use S3 Select

The S3_SELECT external table custom option governs PXF’s use of S3 Select when accessing the S3 object store. You can provide the following values when you set the S3_SELECT option:

S3-SELECT Value Description
OFF PXF does not use S3 Select; the default.
ON PXF always uses S3 Select.
AUTO PXF uses S3 Select when it will benefit access or performance.

By default, PXF does not use S3 Select (S3_SELECT=OFF). You can enable PXF to always use S3 Select, or to use S3 Select only when PXF determines that it could be beneficial for performance. For example, when S3_SELECT=AUTO, PXF automatically uses S3 Select when a query on the external table utilizes column projection or predicate pushdown, or when the referenced CSV file has a header row.

Reading Parquet Data with S3 Select

PXF supports reading Parquet data from S3 as described in Reading and Writing Parquet Data in an Object Store. If you want PXF to use S3 Select when reading the Parquet data, you add the S3_SELECT custom option and value to the CREATE EXTERNAL TABLE LOCATION URI.

Specifying the Parquet Column Compression Type

If columns in the Parquet file are gzip- or snappy-compressed, use the COMPRESSION_CODEC custom option in the LOCATION URI to identify the compression codec alias. For example:

&COMPRESSION_CODEC=gzip

Or,

&COMPRESSION_CODEC=snappy

Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a Parquet file on S3 that you want PXF to access with the S3 Select service:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
  LOCATION ('pxf://<path-to-file>?PROFILE=s3:parquet&SERVER=<server_name>&S3_SELECT=ON|AUTO[&<other-custom-option>=<value>[...]]')
FORMAT 'CSV';
You must specify FORMAT 'CSV' when you enable PXF to use S3 Select on an external table that accesses a Parquet file on S3.

For example, use the following command to have PXF use S3 Select to access a Parquet file on S3 when optimal:

CREATE EXTERNAL TABLE parquet_on_s3 ( LIKE table1 )
  LOCATION ('pxf://bucket/file.parquet?PROFILE=s3:parquet&SERVER=s3srvcfg&S3_SELECT=AUTO')
FORMAT 'CSV';

Reading CSV files with S3 Select

PXF supports reading CSV data from S3 as described in Reading and Writing Text Data in an Object Store. If you want PXF to use S3 Select when reading the CSV data, you add the S3_SELECT custom option and value to the CREATE EXTERNAL TABLE LOCATION URI. You may also specify the delimiter and file header and compression custom options.

Handling the CSV File Header

When you enable PXF to use S3 Select to access a CSV-format file, you use the FILE_HEADER custom option in the LOCATION URI to identify whether or not the CSV file has a header row, and, if so, how you want PXF to handle the header. The FILE_HEADER option takes the following values:

FILE_HEADER Value Description
NONE The file has no header row; the default.
IGNORE The file has a header row; ignore the header.
USE The file has a header row; read the header.

The default FILE_HEADER value is NONE. You can also instruct PXF to ignore, or read, the file header. For example, to have PXF ignore the header, add the following to the CREATE EXTERNAL TABLE LOCATION URI:

&FILE_HEADER=IGNORE
PXF can read a CSV file with a header row only when the S3 Connector uses the Amazon S3 Select service to access the file on S3. PXF does not support reading a CSV file that includes a header row from any other external data store.

Specifying the CSV File Compression Type

If the CSV file is gzip- or bzip2-compressed, use the COMPRESSION_CODEC custom option in the LOCATION URI to identify the compression codec alias. For example:

&COMPRESSION_CODEC=gzip

Or,

&COMPRESSION_CODEC=bzip2

Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a CSV file on S3 that you want PXF to access with the S3 Select service:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>
    ?PROFILE=s3:text&SERVER=<server_name>&S3_SELECT=ON|AUTO[&FILE_HEADER=IGNORE|USE][&COMPRESSION_CODEC=gzip|bzip2][&<other-custom-option>=<value>[...]]')
FORMAT 'CSV' [(delimiter '<delim_char>')];

For example, use the following command to have PXF always use S3 Select to access a gzip-compressed file on S3, where the field delimiter is a pipe (’|’) character and you want to read the header row:

CREATE EXTERNAL TABLE gzippedcsv_on_s3 ( LIKE table2 )
  LOCATION ('pxf://bucket/file.csv.gz?PROFILE=s3:text&SERVER=s3srvcfg&S3_SELECT=ON&FILE_HEADER=USE')
FORMAT 'CSV' (delimiter '|');