One-time HDFS Protocol Installation

One-time HDFS Protocol Installation

Install and configure Hadoop for use with gphdfs as follows:
  1. Install Java 1.6 or later on all Greenplum Database hosts: master, segment, and standby master.
  2. Install a supported Hadoop distribution on all hosts. The distribution must be the same on all hosts. For Hadoop installation information, see the Hadoop distribution documentation.
    Greenplum Database supports the following Hadoop distributions:
    Table 1. Hadoop Distributions
    Hadoop Distribution Version gp_hadoop_ target_version
    Pivotal HD4 Pivotal HD 3.0, 3.0.1 gphd-3.0
    Pivotal HD 2.0, 2.1

    Pivotal HD 1.01

    gphd-2.0
    Greenplum HD4 Greenplum HD 1.2 gphd-1.2
    Greenplum HD 1.1 gphd-1.1 (default)
    Cloudera4 CDH 5.2, 5.3, 5.4.x - 5.8.x cdh5
    CDH 5.0, 5.1 cdh4.1
    CDH 4.12 - CDH 4.7 cdh4.1
    Hortonworks Data Platform HDP 2.1, 2.2, 2.3, 2.4, 2.5 hdp2
    MapR3, 4 MapR 4.x, MapR 5.x gpmr-1.2
    MapR 1.x, 2.x, 3.x gpmr-1.0
    Apache Hadoop 2.x hadoop2
    Note:

    1. Pivotal HD 1.0 is a distribution of Hadoop 2.0.

    2. For CDH 4.1, only CDH4 with MRv1 is supported.

    3. MapR requires the MapR client software.

    4. Support for these Hadoop distributions have been deprecated and will be removed in a future release: Pivotal HD, Greenplum HD, Cloudera CDH 4.1 - CDH 4.7, and MapR 1.x, 2.x, 3.x.

    For the latest information regarding supported Hadoop distributions, see the Greenplum Database Release Notes for your release.
  3. After installation, ensure that the Greenplum system user (gpadmin) has read and execute access to the Hadoop libraries or to the Greenplum MR client.
  4. Set the following environment variables on all segments:
    • JAVA_HOME – the Java home directory
    • HADOOP_HOME – the Hadoop home directory
    For example, add lines such as the following to the gpadmin user .bashrc profile.
    export JAVA_HOME=/usr/java/default
    export HADOOP_HOME=/usr/lib/gphd

    The variables must be set in the ~gpadmin/.bashrc or the ~gpadmin/.bash_profile file so that the gpadmin user shell environment can locate the Java home and Hadoop home.

  5. Set the following Greenplum Database server configuration parameters and restart Greenplum Database.
    Table 2. Server Configuration Parameters for Hadoop Targets
    Configuration Parameter Description Default Value Set Classifications
    gp_hadoop_target_version The Hadoop target. Choose one of the following.

    gphd-1.0

    gphd-1.1

    gphd-1.2

    gphd-2.0

    gpmr-1.0

    gpmr-1.2

    hdp2

    cdh4.1

    gphd-1.1 master

    session

    reload
    gp_hadoop_home When using Pivotal HD, specify the installation directory for Hadoop. For example, the default installation directory is /usr/lib/gphd.

    When using Greenplum HD 1.2 or earlier, specify the same value as the HADOOP_HOME environment variable.

    NULL master

    session

    reload

    For example, the following commands use the Greenplum Database utilities gpconfig and gpstop to set the server configuration parameters and restart Greenplum Database:
    gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"
    gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'"
    gpstop -u
    For information about the Greenplum Database utilities gpconfig and gpstop, see the Greenplum Database Utility Guide.
  6. If needed, ensure that the CLASSPATH environment variable generated by the $GPHOME/lib/hadoop/hadoop_env.sh file on every Greenplum Database host contains the path to JAR files that contain Java classes that are required for gphdfs.

    For example, if gphdfs returns a class not found exception, ensure the JAR file containing the class is on every Greenplum Database host and update the $GPHOME/lib/hadoop/hadoop_env.sh file so that the CLASSPATH environment variable created by file contains the JAR file.