Reading ORC Data from an Object Store

The PXF object store connectors support reading ORC-format data. This section describes how to use PXF to access ORC data in an object store, including how to create and query an external table that references a file in the store.

Note: Accessing ORC-format data from an object store is very similar to accessing ORC-format data in HDFS. This topic identifies object store-specific information required to read ORC data, and links to the PXF Hadoop ORC documentation where appropriate for common information.

Prerequisites

Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from an object store.

Data Type Mapping

Refer to Data Type Mapping in the PXF Hadoop ORC documentation for a description of the mapping between Greenplum Database and ORC data types.

Creating the External Table

Use the <objstore>:orc profile to read ORC-format files from an object store. PXF supports the following <objstore> profile prefixes:

Object Store Profile Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3

The following syntax creates a Greenplum Database readable external table that references an ORC-format file:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:orc&SERVER=<server_name>[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑file> The path to the directory or file in the object store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑file> must not specify a relative path nor include the dollar sign ($) character.
PROFILE=<objstore>:orc The PROFILE keyword must identify the specific object store. For example, s3:orc.
SERVER=<server_name> The named server configuration that PXF uses to access the data.
<custom‑option>=<value> ORC supports customs options as described in the PXF Hadoop ORC documentation
FORMAT ‘CUSTOM’ Use FORMAT 'CUSTOM' with the <objstore>:orc profile. The CUSTOM FORMAT requires that you specify (FORMATTER='pxfwritable_import').

If you are accessing an S3 object store, you can provide S3 credentials via custom options in the CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration with DDL.

Example

Refer to Example: Reading an ORC File on HDFS in the PXF Hadoop ORC documentation for an example. Modifications that you must make to run the example with an object store include:

  • Copying the file to the object store instead of HDFS. For example, to copy the file to S3:

    $ aws s3 cp /tmp/sampledata.orc s3://BUCKET/pxf_examples/
    
  • Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described above. For example, if your server name is s3srvcfg:

    CREATE EXTERNAL TABLE sample_orc( location TEXT, month TEXT, num_orders INTEGER, total_sales NUMERIC(10,2)
      LOCATION('pxf://BUCKET/pxf_examples/sampledata.orc?PROFILE=s3:orc&SERVER=s3srvcfg')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');