Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)

You can use PXF to access Azure Data Lake, Azure Blob Storage, and Google Cloud Storage object stores. This topic describes how to configure the PXF connectors to these external data sources.

If you do not plan to use these PXF object store connectors, then you do not need to perform this procedure.

About Object Store Configuration

To access data in an object store, you must provide a server location and client credentials. When you configure a PXF object store connector, you add at least one named PXF server configuration for the connector as described in Configuring PXF Servers.

PXF provides a template configuration file for each object store connector. These template files are located in the $PXF_CONF/templates/ directory.

Azure Blob Storage Server Configuration

The template configuration file for Azure Blob Storage is $PXF_CONF/templates/wasbs-site.xml. When you configure an Azure Blob Storage server, you must provide the following server configuration properties and replace the template value with your account name:

Property Description Value
fs.adl.oauth2.access.token.provider.type The token type. Must specify ClientCredential.
fs.azure.account.key.<YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME>.blob.core.windows.net The Azure account key. Replace with your account key.
fs.AbstractFileSystem.wasbs.impl The file system class name. Must specify org.apache.hadoop.fs.azure.Wasbs.

Azure Data Lake Server Configuration

The template configuration file for Azure Data Lake is $PXF_CONF/templates/adl-site.xml. When you configure an Azure Data Lake server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
fs.adl.oauth2.access.token.provider.type The type of token. Must specify ClientCredential.
fs.adl.oauth2.refresh.url The Azure endpoint to which to connect. Your refresh URL.
fs.adl.oauth2.client.id The Azure account client ID. Your client ID (UUID).
fs.adl.oauth2.credential The password for the Azure account client ID. Your password.

Google Cloud Storage Server Configuration

The template configuration file for Google Cloud Storage is $PXF_CONF/templates/gs-site.xml. When you configure a Google Cloud Storage server, you must provide the following server configuration properties and replace the template values with your credentials:

Property Description Value
google.cloud.auth.service.account.enable Enable service account authorization. Must specify true.
google.cloud.auth.service.account.json.keyfile The Google Storage key file. Path to your key file.
fs.AbstractFileSystem.gs.impl The file system class name. Must specify com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS.

Example Server Configuration Procedure

Ensure that you have initialized PXF before you configure an object store connector server.

In this procedure, you name and add a PXF server configuration in the $PXF_CONF/servers directory on the Greenplum Database master host for the Google Cloud Storate (GCS) connector. You then use the pxf cluster sync command to sync the server configuration(s) to the Greenplum Database cluster.

  1. Log in to your Greenplum Database master node:

    $ ssh gpadmin@<gpmaster>
    
  2. Choose a name for the server. You will provide the name to end users that need to reference files in the object store.

  3. Create the $PXF_CONF/servers/<server_name> directory. For example, use the following command to create a server configuration for a Google Cloud Storage server named gs_public:

    gpadmin@gpmaster$ mkdir $PXF_CONF/servers/gs_public
    
  4. Copy the PXF template file for GCS to the server configuration directory. For example:

    gpadmin@gpmaster$ cp $PXF_CONF/templates/gs-site.xml $PXF_CONF/servers/gs_public/
    
  5. Open the template server configuration file in the editor of your choice, and provide appropriate property values for your environment. For example, if your Google Cloud Storage key file is located in /home/gpadmin/keys/gcs-account.key.json:

    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>google.cloud.auth.service.account.enable</name>
            <value>true</value>
        </property>
        <property>
            <name>google.cloud.auth.service.account.json.keyfile</name>
            <value>/home/gpadmin/keys/gcs-account.key.json</value>
        </property>
        <property>
            <name>fs.AbstractFileSystem.gs.impl</name>
            <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
        </property>
    </configuration>
    
  6. Save your changes and exit the editor.

  7. Use the pxf cluster sync command to copy the new server configurations to the Greenplum Database cluster. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync