Configuring, Initializing, and Managing PXF

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

The Greenplum Platform Extension Framework (PXF) is composed of a Greenplum Database protocol and a Java service that map an external data source to a table definition. This topic describes how to configure, initialize, and manage PXF.

Installing PXF

PXF is installed on your master node when you install Greenplum Database. You install PXF on your Greenplum Database segment hosts when you invoke the gpseginstall command.

You must explicitly initialize and start PXF before you can use the framework. You must also explicitly enable PXF in each database in which you plan to use it.

PXF Install Files/Directories

The following PXF files and directories are installed in your Greenplum Database cluster. These files/directories are relative to $GPHOME:

Directory Description
pxf The PXF installation directory.
pxf/apache-tomcat The PXF tomcat directory.
pxf/bin The PXF script and executable directory.
pxf/conf The PXF configuration directory. This directory contains the pxf-env.sh, pxf-public.classpath, pxf-private.classpath and pxf-profiles.xml configuration files.
pxf/conf-templates Configuration templates for PXF.
pxf/lib The PXF library directory.
pxf/logs, The PXF log file directory. Includes pxf-service.log and Tomcat-related logs including catalina.out. The log directory and log files are readable only by the gpadmin user.
pxf/pxf-service After initializing PXF, the PXF service instance directory.
pxf/run After starting PXF, the PXF run directory. Includes a PXF catalina process id file.
pxf/tomcat-templates Tomcat templates for PXF.

Initializing PXF

You must explicitly initialize the PXF service instance. This one-time initialization creates the PXF service web application. It also updates PXF configuration files to include information specific to your Hadoop cluster configuration.

Prerequisites

Before initializing PXF in your Greenplum Database cluster, ensure that you have:

  • Installed and configured the required Hadoop clients on each Greenplum Database segment host. Refer to Installing and Configuring Hadoop Clients for PXF for instructions.
  • Granted read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the gpadmin user.

Procedure

Perform the following procedure to initialize PXF on each segment host in your Greenplum Database cluster. You will use the gpssh utility to run a command on multiple hosts.

  1. Log in to the Greenplum Database master node and set up your environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named seghostfile may include:

    seghost1
    seghost2
    seghost3
    
  3. If not already present, install the unzip package on each Greenplum Database segment host:

    gpadmin@gpmaster$ gpssh -e -v -f seghostfile "sudo yum -y install unzip"
    
  4. Run the pxf init command to initialize the PXF service on each segment host. For example:

    gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf init"
    

    The init command creates and initializes the PXF web application. It also updates the pxf-private.classpath file to include entries for your Hadoop distribution JAR files.

Starting PXF

After initializing PXF, you must explicitly start PXF on each segment host in your Greenplum Database cluster. The PXF service, once started, runs as the gpadmin user on default port 5888. Only the gpadmin user can start and stop the PXF service.

Perform the following procedure to start PXF on each segment host in your Greenplum Database cluster. You will use the gpssh command and a seghostfile to run the command on multiple hosts.

  1. Log in to the Greenplum Database master node and set up your environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Run the pxf start command to start PXF on each segment host. For example, if seghostfile contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster:

    $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf start"
    

Stopping PXF

If you must stop PXF, for example if you are upgrading PXF, you must explicitly stop PXF on each segment host in your Greenplum Database cluster. Only the gpadmin user can stop the PXF service.

Perform the following procedure to stop PXF on each segment host in your Greenplum Database cluster. You will use the gpssh command and a seghostfile to run the command on multiple hosts.

  1. Log in to the Greenplum Database master node and set up your environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Run the pxf stop command to stop PXF on each segment host. For example, if seghostfile contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster:

    $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf stop"
    

PXF Service Management

The pxf command supports init, start, stop, restart, and status operations. These operations run locally. That is, if you want to start or stop the PXF agent on a specific segment host, you can log in to the host and run the command. If you want to start or stop the PXF agent on multiple segment hosts, use the gpssh utility as shown above, or individually log in to each segment host and run the command.

Note: If you update your Hadoop or Hive configuration while the PXF service is running, you must copy any updated configuration files to each Greenplum Database segment host and restart PXF on each host.