Upgrading PXF When you Upgrade Greenplum 5

If you are using PXF in your current Greenplum Database 5.x installation, you must perform some PXF upgrade actions when you upgrade to a new version of Greenplum Database 5.x.

Note: Starting in Greenplum Database version 5.28.1, the PXF software is no longer provided in the Greenplum server distribution. You may be required to download and install the PXF rpm package to use the service in your Greenplum cluster as described in the procedures below.

The PXF upgrade procedure describes how to upgrade PXF in your Greenplum Database installation. This procedure uses PXF.from to refer to your currently-installed PXF version and PXF.to to refer to the PXF installed when you upgrade to the new version of Greenplum Database.

Most PXF installations do not require modifications to PXF configuration files and should experience a seamless upgrade.

Note: Starting in Greenplum Database version 5.12, PXF no longer requires a Hadoop client installation. PXF now bundles all of the JAR files on which it depends, and loads these JARs at runtime.

The PXF upgrade procedure has two parts. You perform one procedure before, and one procedure after, you upgrade to a new version of Greenplum Database:

Step 1: PXF Pre-Upgrade Actions

Perform this procedure before you upgrade to a new version of Greenplum Database:

  1. Log in to the Greenplum Database master node. For example:

    $ ssh gpadmin@<gpmaster>
    
  2. Identify and note the PXF.from version number. For example:

    gpadmin@gpmaster$ pxf version
    
  3. Determine if PXF.from is a PXF rpm installation, or if you are running PXF.from from the Greenplum Database server installation, and note the answer.

  4. Stop PXF on each segment host as described in Stopping PXF.

  5. If you are upgrading from Greenplum Database version 5.14 or earlier:

    1. Back up the PXF.from configuration files found in the $GPHOME/pxf/conf/ directory. These files should be the same on all segment hosts, so you need only copy from one of the hosts. For example:

      gpadmin@gpmaster$ mkdir -p /save/pxf-from-conf
      gpadmin@gpmaster$ scp gpadmin@seghost1:/usr/local/greenplum-db/pxf/conf/* /save/pxf-from-conf/
      
    2. Note the locations of any custom JAR files that you may have added to your PXF.from installation. Save a copy of these JAR files.

  6. Upgrade to the new version of Greenplum Database.

  7. Continue your PXF upgrade with Step 2: Upgrading PXF.

Step 2: Upgrading PXF

After you upgrade to the new version of Greenplum Database, perform the following procedure to upgrade and configure the PXF.to software:

  1. Log in to the Greenplum Database master node. For example:

    $ ssh gpadmin@<gpmaster>
    
  2. Set up the environment for the new/upgraded Greenplum installation:

    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  3. If PXF.from was a PXF rpm installation:

    1. Copy the PXF extension files from the PXF installation directory to the new Greenplum 5 install directory:

      gpadmin@gpmaster$ pxf cluster register
      
    2. Start PXF on each segment host as described in Starting PXF.

    3. Exit this procedure.

  4. Starting in Greenplum Database version 5.28.1, PXF is removed from the Greenplum server distribution. You must download and install the separate PXF .rpm package as described in Installing PXF.

  5. Initialize PXF on each segment host as described in Initializing PXF. You may choose to use your existing $PXF_CONF for the initialization.

  6. PXF user impersonation is on by default in Greenplum Database version 5.5.0 and later. If you are upgrading from an older Greenplum version, you must configure user impersonation for the underlying Hadoop services. Refer to Configuring User Impersonation and Proxying for instructions, including the configuration procedure to turn off PXF user impersonation.

  7. If you are upgrading from Greenplum Database version 5.14 or earlier:

    1. If you updated the pxf-env.sh configuration file in your PXF.from installation, re-apply those changes to $PXF_CONF/conf/pxf-env.sh. For example:

      gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
         <update the file>
      
    2. Similarly, if you updated the pxf-profiles.xml configuration file in your PXF.from installation, re-apply those changes to $PXF_CONF/conf/pxf-profiles.xml on the master host.

      Note: Starting in Greenplum Database version 5.12, the package name for PXF classes was changed to use the prefix org.greenplum.*. If you are upgrading from an older Greenplum version and you customized the pxf-profiles.xml file, you must change any org.apache.hawq.pxf.* references to org.greenplum.pxf.* when you re-apply your changes.

    3. If you updated the pxf-log4j.properties configuration file in your PXF.from installation, re-apply those changes to $PXF_CONF/conf/pxf-log4j.properties on the master host.

    4. If you updated the pxf-public.classpath configuration file in your PXF.from installation, copy every JAR referenced in the file to $PXF_CONF/lib on the master host.

    5. If you added additional JAR files to your PXF.from installation, copy them to $PXF_CONF/lib on the master host.

    6. Starting in Greenplum Database version 5.15, PXF requires that the Hadoop configuration files reside in the $PXF_CONF/servers/default directory. If you configured PXF Hadoop connectors in your PXF.from installation, copy the Hadoop configuration files in /etc/<hadoop_service>/conf to $PXF_CONF/servers/default on the Greenplum Database master host.

    7. Starting in Greenplum Database version 5.15, the default Kerberos keytab file location for PXF is $PXF_CONF/keytabs. If you previously configured PXF for secure HDFS and the PXF keytab file is located in a PXF.from installation directory (for example, $GPHOME/pxf/conf), consider relocating the keytab file to $PXF_CONF/keytabs. Alternatively, update the PXF_KEYTAB property setting in the $PXF_CONF/conf/pxf-env.sh file to reference your keytab file.

  8. If you are upgrading from Greenplum Database version 5.18 or earlier:

    1. PXF now bundles Hadoop version 2.9.2 dependent JAR files. If you registered additional Hadoop-related JAR files in $PXF_CONF/lib, ensure that these libraries are compatible with Hadoop version 2.9.2.
    2. The PXF JDBC connector now supports file-based server configuration. If you choose to utilize this new feature with existing external tables that reference an external SQL database, refer to the configuration instructions in Configuring the JDBC Connector for additional information.
    3. The PXF JDBC connector now supports statement query timeouts. This feature requires explicit support from the JDBC driver. The default query timeout is 0, wait forever. Some JDBC drivers support a 0 timeout value without fully supporting the statement query timeout feature. Ensure that any JDBC driver that you have registered supports the default timeout, or better yet, fully supports this feature. You may need to update your JDBC driver version to obtain this support. Refer to the configuration instructions in JDBC Driver JAR Registration for information about registering JAR files with PXF.
    4. If you plan to use Hive partition filtering with integral types, you must set the hive.metastore.integral.jdo.pushdown Hive property in your Hadoop cluster’s hive-site.xml and in your PXF user configuration $PXF_CONF/servers/default/hive-site.xml file. See Updating Hadoop Configuration.
  9. If you are upgrading from Greenplum Database version 5.21.1 or earlier: The PXF Hive connector no longer supports providing a DELIMITER=<delim> option in the LOCATION URI when you create an external table that specifies the HiveText or HiveRC profiles. If you have previously created an external table that specified a DELIMITER in the LOCATION URI, you must drop the table, and recreate it omitting the DELIMITER from the LOCATION. You are still required to provide a non-default delimiter in the external table formatting options.

  10. If you are upgrading from Greenplum Database version 5.23 or earlier and you have configured any JDBC servers that access Kerberos-secured Hive, you must now set the hadoop.security.authentication property in the jdbc-site.xml file to explicitly identify use of the Kerberos authentication method. Perform the following for each of these server configs:

    1. Navigate to the server configuration directory.
    2. Open the jdbc-site.xml file in the editor of your choice and uncomment or add the following property block to the file:

      <property>
          <name>hadoop.security.authentication</name>
          <value>kerberos</value>
      </property>
      
    3. Save the file and exit the editor.

  11. If you are upgrading from Greenplum Database version 5.26 or earlier: The PXF Hive and HiveRC profiles now support column projection using column name-based mapping. If you have any existing PXF external tables that specify one of these profiles, and the external table relied on column index-based mapping, you may be required to drop and recreate the tables:

    1. Identify all PXF external tables that you created that specify a Hive or HiveRC profile.
    2. For each external table that you identify in step 1, examine the definitions of both the PXF external table and the referenced Hive table. If the column names of the PXF external table do not match the column names of the Hive table:

      1. Drop the existing PXF external table. For example:

        DROP EXTERNAL TABLE pxf_hive_table1;
        
      2. Recreate the PXF external table using the Hive column names. For example:

        CREATE EXTERNAL TABLE pxf_hive_table1( hivecolname int, hivecolname2 text )
          LOCATION( 'pxf://default.hive_table_name?PROFILE=Hive')
        FORMAT 'custom' (FORMATTER='pxfwritable_import');
        
      3. Review any SQL scripts that you may have created that reference the PXF external table, and update column names if required.

  12. Synchronize the PXF configuration from the master host to the standby master and each Greenplum Database segment host. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
    
  13. Start PXF on each segment host as described in Starting PXF.