Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.
PXF is installed with connectors to Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, and S3 object stores.
Prerequisites
Before working with object store data using PXF, ensure that:
- You have configured and initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See Configuring PXF for additional information.
- You have configured the PXF Object Store Connectors that you plan to use on each Greenplum Database segment host. Refer to Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores for instructions.
Connectors, Data Formats, and Profiles
The PXF object store connectors provide built-in profiles to support the following data formats:
- Text
- Avro
- JSON
- Parquet
- AvroSequenceFile
- SequenceFile
The PXF connectors to Azure, Google Cloud Storage, Minio, and S3 expose the following profiles to read, and in many cases write, these supported data formats:
Data Format | Azure Blob Storage | Azure Data Lake | Google Cloud Storage | S3 or Minio |
---|---|---|---|---|
delimited single line plain text | wasbs:text | adl:text | gs:text | s3:text |
delimited text with quoted linefeeds | wasbs:text:multi | adl:text:multi | gs:text:multi | s3:text:multi |
Avro | wasbs:avro | adl:avro | gs:avro | s3:avro |
JSON | wasbs:json | adl:json | gs:json | s3:json |
Parquet | wasbs:parquet | adl:parquet | gs:parquet | s3:parquet |
AvroSequenceFile | wasbs:AvroSequenceFile | adl:AvroSequenceFile | gs:AvroSequenceFile | s3:AvroSequenceFile |
SequenceFile | wasbs:SequenceFile | adl:SequenceFile | gs:SequenceFile | s3:SequenceFile |
You provide the profile name when you specify the pxf
protocol on a CREATE EXTERNAL TABLE
command to create a Greenplum Database external table that references a file or directory in the specific object store.
Sample CREATE EXTERNAL TABLE Commands
The following command creates an external table that references a text file on S3. It specifies the profile named s3:text
and the server configuration named s3srvcfg
:
CREATE EXTERNAL TABLE pxf_s3_text(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://S3_BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg') FORMAT 'TEXT' (delimiter=E',');
The following command creates an external table that references a text file on Azure Blob Storage. It specifies the profile named wasbs:text
and the server configuration named wasbssrvcfg
. You would provide the Azure Blob Storage container identifier and your Azure Blob Storage account name.
CREATE EXTERNAL TABLE pxf_wasbs_text(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://AZURE_CONTAINER@YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME.blob.core.windows.net/path/to/blob/file?PROFILE=wasbs:text&SERVER=wasbssrvcfg') FORMAT 'TEXT';
The following command creates an external table that references a text file on Azure Data Lake. It specifies the profile named adl:text
and the server configuration named adlsrvcfg
. You would provide your Azure Data Lake account name.
CREATE EXTERNAL TABLE pxf_adl_text(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://YOUR_ADL_ACCOUNT_NAME.azuredatalakestore.net/path/to/file?PROFILE=adl:text&SERVER=adlsrvcfg') FORMAT 'TEXT';
The following command creates an external table that references a JSON file on Google Cloud Storage. It specifies the profile named gs:json
and the server configuration named gcssrvcfg
:
CREATE EXTERNAL TABLE pxf_gsc_json(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://dir/subdir/file.json?PROFILE=gs:json&SERVER=gcssrvcfg') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Overriding the S3 Server Configuration
If you are accessing an S3 object store, you can override the S3 server configuration by directly specifying the S3 access ID and secret key via these custom options in the CREATE EXTERNAL TABLE
LOCATION
clause:
Custom Option | Value Description |
---|---|
accesskey | The AWS account access key ID. |
secretkey | The secret key associated with the AWS access key ID. |
For example:
CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int) LOCATION ('pxf://S3_BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg&accesskey=YOURKEY&secretkey=YOURSECRET') FORMAT 'TEXT' (delimiter=E',');
PXF does not support overriding Azure, Google Cloud Storage, and Minio server credentials in this manner at this time.