Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF

PXF is installed with connectors to Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, and S3 object stores.

Prerequisites

Before working with object store data using PXF, ensure that:

Connectors, Data Formats, and Profiles

The PXF object store connectors provide built-in profiles to support the following data formats:

  • Text
  • CSV
  • Avro
  • JSON
  • ORC
  • Parquet
  • AvroSequenceFile
  • SequenceFile

The PXF connectors to Azure expose the following profiles to read, and in many cases write, these supported data formats:

Data Format Azure Blob Storage Azure Data Lake Supported Operations
delimited single line plain text wasbs:text adl:text Read, Write
delimited single line comma-separated values of plain text wasbs:csv adl:csv Read, Write
delimited text with quoted linefeeds wasbs:text:multi adl:text:multi Read
Avro wasbs:avro adl:avro Read, Write
JSON wasbs:json adl:json Read
ORC wasbs:orc adl:orc Read
Parquet wasbs:parquet adl:parquet Read, Write
AvroSequenceFile wasbs:AvroSequenceFile adl:AvroSequenceFile Read, Write
SequenceFile wasbs:SequenceFile adl:SequenceFile Read, Write

Similarly, the PXF connectors to Google Cloud Storage, Minio, and S3 expose these profiles:

Data Format Google Cloud Storage S3 or Minio Supported Operations
delimited single line plain text gs:text s3:text Read, Write
delimited single line comma-separated values of plain text gs:csv s3:csv Read, Write
delimited text with quoted linefeeds gs:text:multi s3:text:multi Read
Avro gs:avro s3:avro Read, Write
JSON gs:json s3:json Read
ORC gs:orc s3:orc Read
Parquet gs:parquet s3:parquet Read, Write
AvroSequenceFile gs:AvroSequenceFile s3:AvroSequenceFile Read, Write
SequenceFile gs:SequenceFile s3:SequenceFile Read, Write

You provide the profile name when you specify the pxf protocol on a CREATE EXTERNAL TABLE command to create a Greenplum Database external table that references a file or directory in the specific object store.

Sample CREATE EXTERNAL TABLE Commands

When you create an external table that references a file or directory in an object store, you must specify a SERVER in the LOCATION URI.

The following command creates an external table that references a text file on S3. It specifies the profile named s3:text and the server configuration named s3srvcfg:

CREATE EXTERNAL TABLE pxf_s3_text(location text, month text, num_orders int, total_sales float8)
  LOCATION ('pxf://S3_BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=E',');

The following command creates an external table that references a text file on Azure Blob Storage. It specifies the profile named wasbs:text and the server configuration named wasbssrvcfg. You would provide the Azure Blob Storage container identifier and your Azure Blob Storage account name.

CREATE EXTERNAL TABLE pxf_wasbs_text(location text, month text, num_orders int, total_sales float8)
  LOCATION ('pxf://AZURE_CONTAINER@YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME.blob.core.windows.net/path/to/blob/file?PROFILE=wasbs:text&SERVER=wasbssrvcfg')
FORMAT 'TEXT';

The following command creates an external table that references a text file on Azure Data Lake. It specifies the profile named adl:text and the server configuration named adlsrvcfg. You would provide your Azure Data Lake account name.

CREATE EXTERNAL TABLE pxf_adl_text(location text, month text, num_orders int, total_sales float8)
  LOCATION ('pxf://YOUR_ADL_ACCOUNT_NAME.azuredatalakestore.net/path/to/file?PROFILE=adl:text&SERVER=adlsrvcfg')
FORMAT 'TEXT';

The following command creates an external table that references a JSON file on Google Cloud Storage. It specifies the profile named gs:json and the server configuration named gcssrvcfg:

CREATE EXTERNAL TABLE pxf_gsc_json(location text, month text, num_orders int, total_sales float8)
  LOCATION ('pxf://dir/subdir/file.json?PROFILE=gs:json&SERVER=gcssrvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');