Configuring PXF Hadoop Connectors (Optional)

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

PXF is compatible with Cloudera, Hortonworks Data Platform, and generic Apache Hadoop distributions. This topic describes how configure the PXF Hadoop, Hive, and HBase connectors.

If you do not plan to use the Hadoop-related PXF connectors, or you already have Hadoop client installations on your Greenplum Database segments hosts, you need not perform this procedure.

Prerequisites

PXF bundles all of the JAR files on which it depends, including those for Hadoop services, and loads these JARs at runtime. Configuring PXF Hadoop connectors involves copying configuration files from your Hadoop cluster to each Greenplum Database segment host. Before you configure PXF Hadoop, Hive, and HBase connectors, ensure that you have:

  • scp access to hosts running the HDFS, Hive, and HBase services in your Hadoop cluster.

In this procedure, you copy Hadoop configuration files to /etc/<hadoop-service>/conf directories on each Greenplum Database segment host. You must create these directories if they do not exist, and assign read and write access permissions to the gpadmin user. You need only create the directories for the Hadoop services that you plan to use. For example, to create the directories and assign permission:

root@seghost$ mkdir -p /etc/hadoop/conf /etc/hive/conf /etc/hbase/conf
root@seghost$ chown -R gpadmin:gpadmin /etc/hadoop/conf /etc/hive/conf /etc/hbase/conf

Procedure

Perform the following procedure to configure the desired PXF Hadoop-related connectors on each segment host in your Greenplum Database cluster.

You will use the gpssh and gpscp utilities where possible to run a command on multiple hosts.

  1. Log in to your Greenplum Database master node and set up the environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. PXF requires information from core-site.xml and other Hadoop configuration files. Copy relevant configuration from your Hadoop cluster to each Greenplum Database segment host.

    1. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from your Hadoop cluster NameNode host to the current host. For example:

      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
      
    2. Next, copy these Hadoop configuration files to each Greenplum Database segment host. For example:

      gpadmin@gpmaster$ gpscp -v -f seghostfile core-site.xml =:/etc/hadoop/conf/core-site.xml
      gpadmin@gpmaster$ gpscp -v -f seghostfile hdfs-site.xml =:/etc/hadoop/conf/hdfs-site.xml
      gpadmin@gpmaster$ gpscp -v -f seghostfile mapred-site.xml =:/etc/hadoop/conf/mapred-site.xml
      gpadmin@gpmaster$ gpscp -v -f seghostfile yarn-site.xml =:/etc/hadoop/conf/yarn-site.xml
      
  3. If you plan to use the PXF Hive connector to access Hive table data, similarly copy Hive configuration to each Greenplum Database segment host.

    1. Copy the hive-site.xml Hive configuration file from one of the hosts on which your Hive service is running to the current host. For example:

      gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
      
    2. Next, copy the hive-site.xml configuration file to each Greenplum Database segment host. For example:

      gpadmin@gpmaster$ gpscp -v -f seghostfile hive-site.xml =:/etc/hive/conf/hive-site.xml
      
  4. If you plan to use the PXF HBase connector to access HBase table data, similarly copy HBase configuration to each Greenplum Database segment host.

    1. Copy the hbase-site.xml HBase configuration file from one of the hosts on which your HBase service is running to the current host. For example:

      gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml .
      
    2. Next, copy the hbase-site.xml configuration file to each Greenplum Database segment host. For example:

      gpadmin@gpmaster$ gpscp -v -f seghostfile hbase-site.xml =:/etc/hbase/conf/hbase-site.xml
      
  5. PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access HDFS, Hive, and HBase using the identity of the Greenplum Database user account that logs into Greenplum Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive and HBase if you intend to use those PXF connectors. Follow procedures in Configuring User Impersonation and Proxying to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation.

  6. Grant read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the gpadmin user.

  7. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host as described in Configuring PXF for Secure HDFS.

Updating Hadoop Configuration

If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated .xml file(s) to each Greenplum Database segment host and restart PXF.