Troubleshooting PXF

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.

PXF Errors

The following table describes some errors you may encounter while using PXF:

Error Message Discussion
Protocol “pxf” does not exist Cause: The pxf extension was not registered.
Remedy: Create (enable) the PXF extension for the database as described in the PXF Enable Procedure.
Invalid URI pxf://<path-to-data>: missing options section Cause: The LOCATION URI does not include the profile or other required options.
Remedy: Provide the profile and required options in the URI.
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://<namenode>:8020/<path-to-file> Cause: The HDFS file you specified in <path-to-file> does not exist.
Remedy: Provide the path to an existing HDFS file.
NoSuchObjectException(message:<schema>.<hivetable> table not found) Cause: The Hive table you specified with <schema>.<hivetable> does not exist.
Remedy: Provide the name of an existing Hive table.
Failed to connect to <segment-host> port 5888: Connection refused (libchurl.c:944) (<segment-id> slice<N> <segment-host>:40000 pid=<process-id>)
Cause: PXF is not running on <segment-host>.
Remedy: Restart PXF on <segment-host>.
ERROR: failed to acquire resources on one or more segments
DETAIL: could not connect to server: Connection refused
    Is the server running on host “<segment-host>” and accepting
    TCP/IP connections on port 40000?(seg<N> <segment-host>:40000)
Cause: The Greenplum Database segment host <segment-host> is down.
org.apache.hadoop.security.AccessControlException: Permission denied: user=, access=READ, inode=“”:::-rw——- Cause: The Greenplum Database user that executed the PXF operation does not have permission to access the underlying Hadoop service (HDFS or Hive). See About PXF User Impersonation.

PXF Logging

Enabling more verbose logging may aid PXF troubleshooting efforts. PXF provides two categories of message logging: service-level and client-level.

Service-Level Logging

PXF utilizes log4j for service-level logging. PXF-service-related log messages are captured in a log file specified by PXF’s log4j properties file, $GPHOME/pxf/conf/pxf-log4j.properties. The default PXF logging configuration will write INFO and more severe level logs to $GPHOME/pxf/logs/pxf-service.log. You can configure the logging level and log file location.

PXF provides more detailed logging when the DEBUG level is enabled. To configure PXF DEBUG logging, uncomment the following line in pxf-log4j.properties:

#log4j.logger.org.greenplum.pxf=DEBUG

Copy the pxf-log4j.properties file to each segment host and restart the PXF service on each Greenplum Database segment host. For example:

gpadmin@gpmaster$ gpscp -v -f seghostfile $GPHOME/pxf/conf/pxf-log4j.properties :=/usr/local/greenplum-db/pxf/conf/pxf-log4j.properties
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf restart"

With DEBUG level logging now enabled, perform your PXF operations; for example, create and query an external table. (Make note of the time; this will direct you to the relevant log messages in $GPHOME/pxf/logs/pxf-service.log.)

$ date
Wed Oct  4 09:30:06 MDT 2017
$ psql -d <dbname>
dbname=> CREATE EXTERNAL TABLE hdfstest(id int, newid int)
    LOCATION ('pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple')
    FORMAT 'TEXT' (delimiter='E',');
dbname=> SELECT * FROM hdfstest;
<select output>

Examine/collect the log messages from pxf-service.log.

Note: DEBUG logging is quite verbose and has a performance impact. Remember to turn off PXF service DEBUG logging after you have collected the desired information.

Client-Level Logging

Database-level client logging may provide insight into internal PXF service operations.

Enable Greenplum Database and PXF debug message logging during operations on PXF external tables by setting the client_min_messages server configuration parameter to DEBUG2 in your psql session.

$ psql -d <dbname>
dbname=# SET client_min_messages=DEBUG2
dbname=# SELECT * FROM hdfstest;
...
DEBUG2:  churl http header: cell #19: X-GP-URL-HOST: seghost1  (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT:  External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2:  churl http header: cell #20: X-GP-URL-PORT: 5888  (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT:  External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2:  churl http header: cell #21: X-GP-DATA-DIR: data/dir/hdfsfile  (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT:  External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2:  churl http header: cell #22: X-GP-OPTIONS-PROFILE: HdfsTextSimple  (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT:  External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2:  churl http header: cell #23: X-GP-URI: pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple  (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT:  External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
...

Examine/collect the log messages from stdout.

Note: DEBUG2 database session logging has a performance impact. Remember to turn off DEBUG2 logging after you have collected the desired information.

dbname=# SET client_min_messages=NOTICE

Addressing PXF Memory Issues

Because a single PXF agent (JVM) serves multiple segments on a segment host, the PXF heap size can be a limiting runtime factor. This will be more evident under concurrent workloads and/or queries against large files. You may run into situations where a query will hang or fail due to insufficient memory or the Java garbage collector impacting response times. To avert or remedy these situations, first try increasing the Java maximum heap size or decreasing the Tomcat maximum number of threads, depending upon what works best for your system configuration.

Note: The configuration changes described in this topic require modifying config files on each segment node in your Greenplum Database cluster. After you perform the updates, be sure to verify that the configuration on all PXF nodes is the same.

You will need to re-apply these configuration changes after any PXF version upgrades.

Increasing the JVM Memory for PXF

Each PXF agent running on a segment host is configured with a default maximum Java heap size of 2GB and an initial heap size of 1GB. If the segment hosts in your Greenplum Database cluster have an ample amount of memory, try increasing the maximum heap size to a value between 3-4GB. Set the initial and maximum heap size to the same value if possible.

Perform the following procedure to increase the heap size for the PXF agent running on each segment host in your Greenplum Database cluster.

  1. Log in to your Greenplum Database master node and set up the environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Edit the $GPHOME/pxf/conf/pxf-env.sh file. For example:

    gpadmin@gpmaster$ vi $GPHOME/pxf/conf/pxf-env.sh
    
  3. Locate the PXF_JVM_OPTS setting in the pxf-env.sh file, and update the -Xmx and/or -Xms options to the desired value. For example:

    PXF_JVM_OPTS="-Xmx3g -Xms3g"
    
  4. Copy the updated pxf-env.sh file to each Greenplum Database segment host. For example, if seghostfile contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster:

    gpadmin@gpmaster$ gpscp -v -f seghostfile $GPHOME/pxf/conf/pxf-env.sh =:/usr/local/greenplum-db/pxf/conf/pxf-env.sh
    
  5. Restart PXF on each Greenplum Database segment host. For example:

    $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf restart"
    

Another Option for Resource-Constrained PXF Segment Hosts

If increasing the maximum heap size is not suitable for your Greenplum Database deployment, try decreasing the number of concurrent working threads configured for PXF’s underlying Tomcat web application. A decrease in the number of running threads will prevent any PXF node from exhausting its memory, while ensuring that current queries run to completion (albeit a bit slower). Tomcat’s default behavior is to queue requests until a thread is free, or the queue is exhausted.

The Tomcat default maximum number of threads is 300. Decrease the maximum number of threads to a value that works optimally for your Greenplum Database deployment. (If you plan to run large workloads on a large number of files in an external Hive data store, specify an even lower value.)

Perform the following procedure to decrease the maximum number of Tomcat threads for the PXF agent running on each segment host in your Greenplum Database deployment.

  1. Log in to your Greenplum Database master node and set up the environment:

    $ ssh gpadmin@<gpmaster>
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
    
  2. Edit the $GPHOME/pxf/pxf-service/conf/server.xml file. For example:

    gpadmin@gpmaster$ vi $GPHOME/pxf/pxf-service/conf/server.xml
    
  3. Locate the Catalina Executor block and update the maxThreads setting to the desired value. For example:

    <Executor maxThreads="100"
              minSpareThreads="50"
              name="tomcatThreadPool"
              namePrefix="tomcat-http--"/>
    
  4. Copy the updated server.xml file to each Greenplum Database segment host. For example, if seghostfile contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster:

    gpadmin@gpmaster$ gpscp -v -f seghostfile $GPHOME/pxf/pxf-service/conf/server.xml =:/usr/local/greenplum-db/pxf/pxf-service/conf/server.xml
    
  5. Restart PXF on each Greenplum Database segment host. For example:

    $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf restart"