Installing and Configuring the Hadoop Client for PXF

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

You use the PXF HDFS connector to access HDFS file data. PXF requires a Hadoop client installation on each Greenplum Database segment host. The Hadoop client must be installed from a tarball.

This topic describes how to install and configure the Hadoop client for PXF access.

Prerequisites

Compatible Hadoop clients for PXF include Cloudera, Hortonworks Data Platform, and generic Apache Hadoop.

Before setting up the Hadoop Client for PXF, ensure that you:

  • Have scp access to the HDFS NameNode host on a running Hadoop cluster.
  • Have access to or permission to install Java version 1.7 or 1.8 on each segment host.

Procedure

Perform the following procedure to install and configure the Hadoop client for PXF on each segment host in your Greenplum Database cluster. You will use the gpssh utility where possible to run a command on multiple hosts.

  1. Log in to your Greenplum Database master node and set up the environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named seghostfile may include:

    seghost1
    seghost2
    seghost3
    
  3. If not already present, install Java on each Greenplum Database segment host. For example:

    gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install java-1.8.0-openjdk-1.8.0*
    
  4. Identify the Java base install directory. Update the gpadmin user’s .bash_profile file on each segment host to include this $JAVA_HOME setting. For example:

    gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre' >> /home/gpadmin/.bash_profile"
    
  5. Download a compatible Hadoop client and install it on each Greenplum Database segment host. The Hadoop client must be a tarball distribution. You must install the same Hadoop client distribution in the same file system location on each host.

    If you are running Cloudera Hadoop:

    1. Download the Hadoop distribution:

      gpadmin@master$ wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.10.2.tar.gz -O /tmp/hadoop-2.6.0-cdh5.10.2.tar.gz
      
    2. Copy the Cloudera Hadoop distribution to each Greenplum Database segment host. For example, to copy the distribution to the /home/gpadmin directory:

      gpadmin@master$ gpscp -v -f seghostfile /tmp/hadoop-2.6.0-cdh5.10.2.tar.gz =:/home/gpadmin
      
    3. Unpack the Cloudera Hadoop distribution on each Greenplum Database segment host. For example:

      gpadmin@master$ gpssh -e -v -f seghostfile "tar zxf /home/gpadmin/hadoop-2.6.0-cdh5.10.2.tar.gz"
      
    4. Ensure that the gpadmin user has read and execute permission on all Hadoop client libraries on each segment host. For example:

      gpadmin@master$ gpssh -e -v -f seghostfile "chmod -R 755 /home/gpadmin/hadoop-2.6.0-cdh5.10.2"
      
  6. Locate the base install directory of the Hadoop client. Edit the gpadmin user’s .bash_profile file on each segment host to include this $PXF_HADOOP_HOME setting. For example:

    gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export PXF_HADOOP_HOME=/home/gpadmin/hadoop-2.6.0-cdh5.10.2' >> /home/gpadmin/.bash_profile"
    
  7. The Hadoop core-site.xml configuration file fs.defaultFS property value identifies the HDFS NameNode URI. PXF requires this information to access the Hadoop cluster. A sample fs.defaultFS setting follows:

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode.domain:8020</value>
    </property>
    

    Complete the PXF Hadoop client configuration by copying configuration files from your Hadoop cluster to each Greenplum Database segment host.

    1. Copy the core-site.xml and hdfs-site.xml Hadoop configuration files from your Hadoop cluster NameNode host to the current host. For example:

      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
      gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
      
    2. Next, copy these Hadoop configuration files to each Greenplum Database segment host. For example:

      gpadmin@gpmaster$ gpscp -v -f seghostfile core-site.xml =:\$PXF_HADOOP_HOME/etc/hadoop/core-site.xml
      gpadmin@gpmaster$ gpscp -v -f seghostfile hdfs-site.xml =:\$PXF_HADOOP_HOME/etc/hadoop/hdfs-site.xml
      
  8. The PXF HDFS connector supports the Avro file format. If you plan to access Avro format files, you must download a required JAR file and copy the JAR to each Greenplum Database segment host.

    1. Download the Avro JAR file required by PXF:

      gpadmin@gpmaster$ wget "http://central.maven.org/maven2/org/apache/avro/avro-mapred/1.7.1/avro-mapred-1.7.1.jar"
      
    2. Copy the Avro JAR file to each Greenplum Database segment host. You must copy the file to the $PXF_HADOOP_HOME/share/hadoop/common/lib directory. For example:

      gpadmin@gpmaster$ gpscp -v -f seghostfile avro-mapred-*.jar =:\$PXF_HADOOP_HOME/share/hadoop/common/lib
      

Note: If you update your Hadoop configuration while the PXF service is running, you must copy the updated core-site.xml and hdfs-site.xml files to each Greenplum Database segment host and restart PXF.