Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores (Optional)

You can use PXF to access Azure Data Lake, Azure Blob Storage, Google Cloud Storage, and S3-compatible object stores. This topic describes how to configure the PXF connectors to these external data sources.

If you do not plan to use the PXF object store connectors, then you do not need to perform this procedure.

About Object Store Configuration

To access data in an object store, you must provide a server location and client credentials. PXF provides a template configuration file for each Hadoop and object store connector. These server template configuration files, located in the $PXF_CONF/templates/ directory, identify the minimum set of properties that you must configure to use the connector.

gpadmin@gpmaster$ ls $PXF_CONF/templates
adl-site.xml   hbase-site.xml  mapred-site.xml  wasbs-site.xml
core-site.xml  hdfs-site.xml   minio-site.xml   yarn-site.xml
gs-site.xml    hive-site.xml   s3-site.xml

For example, the contents of the s3-site.xml template file follow:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_AWS_ACCESS_KEY_ID</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_AWS_SECRET_ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.fast.upload</name>
        <value>true</value>
    </property>
</configuration>
You specify object store credentials to PXF in clear text in configuration files.

When you configure a PXF object store connector, you add at least one named PXF server configuration for the connector. You:

  1. Choose a name for the server configuration.
  2. Create the directory $PXF_CONF/servers/<server_name>.
  3. Copy the PXF template configuration file corresponding to the object store to the new server directory.
  4. Fill in appropriate values for the properties in the template file.
  5. Add additional properties and values if required for your environment.
  6. Synchronize the server configuration to each Greenplum Database segment host.
  7. Publish the PXF server names to your Greenplum Database end users as appropriate.

The Greenplum Database user specifies the server name in the CREATE EXTERNAL TABLE LOCATION clause SERVER option to access the object store. For example:

CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
  LOCATION ('pxf://BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=E',');
A Greenplum Database user who queries or writes to an external table that specifies a server name accesses the object store with the credentials configured for that server.

Azure Blob Storage Server Configuration

The template configuration file for Azure Blob Storage is $PXF_CONF/templates/wasbs-site.xml. When you configure an Azure Blob Storage server, you must provide the following server configuration properties and replace the template value with your account name:

Property Description Value
dfs.adls.oauth2.access.token.provider.type The token type. Must specify ClientCredential.
fs.azure.account.key.<YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME>.blob.core.windows.net The Azure account key. Replace with your account key.
fs.AbstractFileSystem.wasbs.impl The file system class name. Must specify org.apache.hadoop.fs.azure.Wasbs.

Azure Data Lake Server Configuration

The template configuration file for Azure Data Lake is $PXF_CONF/templates/adl-site.xml. When you configure an Azure Data Lake server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
dfs.adls.oauth2.access.token.provider.type The type of token. Must specify ClientCredential.
dfs.adls.oauth2.refresh.url The Azure endpoint to which to connect. Your refresh URL.
dfs.adls.oauth2.client.id The Azure account client ID. Your client ID (UUID).
dfs.adls.oauth2.credential The password for the Azure account client ID. Your password.

Google Cloud Storage Server Configuration

The template configuration file for Google Cloud Storage is $PXF_CONF/templates/gs-site.xml. When you configure a Google Cloud Storage server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
google.cloud.auth.service.account.enable Enable service account authorization. Must specify true.
google.cloud.auth.service.account.json.keyfile The Google Storage key file. Path to your key file.
fs.AbstractFileSystem.gs.impl The file system class name. Must specify com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS.

Minio Server Configuration

The template configuration file for Minio is $PXF_CONF/templates/minio-site.xml. When you configure a Minio server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
fs.s3a.endpoint The Minio S3 endpoint to which to connect. Your endpoint.
fs.s3a.access.key The Minio account access key ID. Your access key.
fs.s3a.secret.key The secret key associated with the Minio access key ID. Your secret key.

S3 Server Configuration

The template configuration file for S3 is $PXF_CONF/templates/s3-site.xml. When you configure an S3 server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
fs.s3a.access.key The AWS account access key ID. Your access key.
fs.s3a.secret.key The secret key associated with the AWS access key ID. Your secret key.

If required, fine-tune PXF S3 connectivity by specifying properties identified in the S3A section of the Hadoop-AWS module documentation in your s3-site.xml server configuration file.

You can override an S3 server configuration by directly specifying the S3 access ID and secret key via custom options in the CREATE EXTERNAL TABLE command LOCATION clause. Refer to Overriding the S3 Server Configuration for additional information.

Configuring S3 Server-Side Encryption

PXF supports Amazon Web Service S3 Server-Side Encryption (SSE) for S3 files that you access with readable and writable Greenplum Database external tables that specify the pxf protocol and an s3:* profile. AWS S3 server-side encryption protects your data at rest; it encrypts your object data as it writes to disk, and transparently decrypts the data for you when you access it.

PXF supports the following AWS SSE encryption key management schemes:

  • SSE with S3-Managed Keys (SSE-S3) - Amazon manages the data and master encryption keys.
  • SSE with Key Management Service Managed Keys (SSE-KMS) - Amazon manages the data key, and you manage the encryption key in AWS KMS.
  • SSE with Customer-Provided Keys (SSE-C) - You set and manage the encryption key.

Your S3 access key and secret key govern your access to all S3 bucket objects, whether the data is encrypted or not.

S3 transparently decrypts data during a read operation of an encrypted file that you access via a readable external table that is created by specifying the pxf protocol and an s3:* profile. No additional configuration is required.

To encrypt data that you write to S3 via this type of external table, you have two options:

  • Configure the default SSE encryption key management scheme on a per-S3-bucket basis via the AWS console or command line tools (recommended).
  • Configure SSE encryption options in your PXF S3 server s3-site.xml configuration file.
Configuring SSE via an S3 Bucket Policy (Recommended)

You can create S3 Bucket Policy(s) that identify the objects that you want to encrypt, the encryption key management scheme, and the write actions permitted on those objects. Refer to Protecting Data Using Server-Side Encryption in the AWS S3 documentation for more information about the SSE encryption key management schemes. How Do I Enable Default Encryption for an S3 Bucket? describes how to set default encryption bucket policies.

Specifying SSE Options in a PXF S3 Server Configuration

You must include certain properties in s3-site.xml to configure server-side encryption in a PXF S3 server configuration. The properties and values that you add to the file are dependent upon the SSE encryption key management scheme.

SSE-S3

To enable SSE-S3 on any file that you write to any S3 bucket, set the following encryption algorithm property and value in the s3-site.xml file:

<property>
  <name>fs.s3a.server-side-encryption-algorithm</name>
  <value>AES256</value>
</property>

To enable SSE-S3 for a specific S3 bucket, use the property name variant that includes the bucket name. For example:

<property>
  <name>fs.s3a.bucket.YOUR_BUCKET1_NAME.server-side-encryption-algorithm</name>
  <value>AES256</value>
</property>

Replace YOUR_BUCKET1_NAME with the name of the S3 bucket.

SSE-KMS

To enable SSE-KMS on any file that you write to any S3 bucket, set both the encryption algorithm and encryption key ID. To set these properties in the s3-site.xml file:

<property>
  <name>fs.s3a.server-side-encryption-algorithm</name>
  <value>SSE-KMS</value>
</property>
<property>
  <name>fs.s3a.server-side-encryption.key</name>
  <value>YOUR_AWS_SSE_KMS_KEY_ARN</value>
</property>

Substitute YOUR_AWS_SSE_KMS_KEY_ARN with your key resource name. If you do not specify an encryption key, the default key defined in the Amazon KMS is used. Example KMS key: arn:aws:kms:us-west-2:123456789012:key/1a23b456-7890-12cc-d345-6ef7890g12f3.

Note: Be sure to create the bucket and the key in the same Amazon Availability Zone.

To enable SSE-KMS for a specific S3 bucket, use property name variants that include the bucket name. For example:

<property>
  <name>fs.s3a.bucket.YOUR_BUCKET2_NAME.server-side-encryption-algorithm</name>
  <value>SSE-KMS</value>
</property>
<property>
  <name>fs.s3a.bucket.YOUR_BUCKET2_NAME.server-side-encryption.key</name>
  <value>YOUR_AWS_SSE_KMS_KEY_ARN</value>
</property>

Replace YOUR_BUCKET2_NAME with the name of the S3 bucket.

SSE-C

To enable SSE-C on any file that you write to any S3 bucket, set both the encryption algorithm and the encryption key (base-64 encoded). All clients must share the same key.

To set these properties in the s3-site.xml file:

<property>
  <name>fs.s3a.server-side-encryption-algorithm</name>
  <value>SSE-C</value>
</property>
<property>
  <name>fs.s3a.server-side-encryption.key</name>
  <value>YOUR_BASE64-ENCODED_ENCRYPTION_KEY</value>
</property>

To enable SSE-C for a specific S3 bucket, use the property name variants that include the bucket name as described in the SSE-KMS example.

Example Configuration Procedure

Ensure that you have initialized PXF before you configure an object store connector.

In this procedure, you name and add a PXF server configuration in the $PXF_CONF/servers directory on the Greenplum Database master host for each object store connector that you plan to use. You then use the pxf cluster sync command to sync the server configuration(s) to all segment hosts.

  1. Log in to your Greenplum Database master node:

    $ ssh gpadmin@<gpmaster>
    
  2. PXF includes connectors to the Azure Data Lake, Azure Blob Storage, Google Cloud Storage, Minio, and S3 object stores. Identify the PXF object store connectors that you want to configure.

  3. For each object store connector that you configure:

    1. Choose a name for the server. You will provide the name to end users that need to reference files in the object store.

      Note: The server name default is reserved.

    2. Create the $PXF_HOME/servers/<server_name> directory. For example, use the following command if you are creating a server configuration for Google Cloud Storage and you want to name the server gs_public:

      gpadmin@gpmaster$ mkdir $PXF_CONF/servers/gs_public
      
    3. Copy the PXF template file for the object store to the server configuration directory. For example:

      gpadmin@gpmaster$ cp $PXF_CONF/templates/gs-site.xml $PXF_CONF/servers/gs_public/
      
    4. Open the template server configuration file in the editor of your choice, and provide appropriate property values for your environment. For example, if your Google Cloud Storage key file is located in /home/gpadmin/keys/gcs-account.key.json:

      <?xml version="1.0" encoding="UTF-8"?>
      <configuration>
          <property>
              <name>google.cloud.auth.service.account.enable</name>
              <value>true</value>
          </property>
          <property>
              <name>google.cloud.auth.service.account.json.keyfile</name>
              <value>/home/gpadmin/keys/gcs-account.key.json</value>
          </property>
          <property>
              <name>fs.AbstractFileSystem.gs.impl</name>
              <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
          </property>
      </configuration>
      
    5. Save your changes and exit the editor.

    6. Repeat Step 3 to configure the next object store connector.

  4. Use the pxf cluster sync command to copy the new server configurations to each Greenplum Database segment host. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
    

Adding or Updating Object Store Configuration

If you add or update the object store server configuration on the Greenplum Database master host, you must re-sync the PXF configuration to the Greenplum Database cluster:

gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync