Configuring PXF Hadoop Connectors (Optional)

PXF is compatible with Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. This topic describes how configure the PXF Hadoop, Hive, and HBase connectors.

If you do not want to use the Hadoop-related PXF connectors, then you do not need to perform this procedure.

Prerequisites

Configuring PXF Hadoop connectors involves copying configuration files from your Hadoop cluster to the Greenplum Database master host. If you are using the MapR Hadoop distribution, you must also copy certain JAR files to the master host. Before you configure the PXF Hadoop connectors, ensure that you can copy files from hosts in your Hadoop cluster to the Greenplum Database master.

Procedure

Perform the following procedure to configure the desired PXF Hadoop-related connectors on the Greenplum Database master host. After you configure the connectors, you will use the pxf cluster sync command to copy the PXF configuration to the Greenplum Database cluster.

In this procedure, you use the default, or create a new, PXF server configuration. You copy Hadoop configuration files to the server configuration directory on the Greenplum Database master host. You may also copy libraries to $PXF_CONF/lib for MapR support. You then synchronize the PXF configuration on the master host to the standby master and segment hosts. (PXF creates the$PXF_CONF/* directories when you run pxf cluster init.)

  1. Log in to your Greenplum Database master node:

    $ ssh gpadmin@<gpmaster>
    
  2. Identify the name of your PXF Hadoop server configuration. If your Hadoop cluster is Kerberized, you must use the default PXF server.

  3. If you are not using the default PXF server, create the $PXF_HOME/servers/<server_name> directory. For example, use the following command to create a Hadoop server configuration named hdp3:

    gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hdp3
    
  4. Change to the server directory. For example:

    gpadmin@gpmaster$ cd $PXF_CONF/servers/default
    

    Or,

    gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
    
  5. PXF requires information from core-site.xml and other Hadoop configuration files. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from your Hadoop cluster NameNode host to the current host using your tool of choice. Your file paths may differ based on the Hadoop distribution in use. For example, these commands use scp to copy the files:

    gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
    gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
    gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
    gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
    
  6. If you plan to use the PXF Hive connector to access Hive table data, similarly copy the Hive configuration to the Greenplum Database master host. For example:

    gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
    
  7. If you plan to use the PXF HBase connector to access HBase table data, similarly copy the HBase configuration to the Greenplum Database master host. For example:

    gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml .
    
  8. If you are using PXF with the MapR Hadoop distribution, you must copy certain JAR files from your MapR cluster to the Greenplum Database master host. (Your file paths may differ based on the version of MapR in use.) For example, these commands use scp to copy the files:

    gpadmin@gpmaster$ cd $PXF_CONF/lib
    gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/maprfs-5.2.2-mapr.jar .
    gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1707.jar .
    gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1707.jar .
    
  9. Synchronize the PXF configuration to the Greenplum Database cluster. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
    
  10. PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access HDFS, Hive, and HBase using the identity of the Greenplum Database user account that logs into Greenplum Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive and HBase if you intend to use those PXF connectors. Follow procedures in Configuring User Impersonation and Proxying to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation.

  11. Grant read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the gpadmin user.

  12. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host as described in Configuring PXF for Secure HDFS.

Updating Hadoop Configuration

If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated configuration to the $PXF_CONF/servers/<server_name> directory and re-sync the PXF configuration to your Greenplum Database cluster. For example:

gpadmin@gpmaster$ cd $PXF_CONF/servers/<server_name>
gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync