Troubleshooting PXF
PXF Errors
The following table describes some errors you may encounter while using PXF:
Error Message | Discussion |
---|---|
Protocol “pxf” does not exist | Cause: The pxf extension was not registered.Solution: Create (enable) the PXF extension for the database as described in the PXF Enable Procedure. |
Invalid URI pxf://<path-to-data>: missing options section | Cause: The LOCATION URI does not include the profile or other required options.Solution: Provide the profile and required options in the URI. |
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://<namenode>:8020/<path-to-file> | Cause: The HDFS file that you specified in <path-to-file> does not exist. Solution: Provide the path to an existing HDFS file. |
NoSuchObjectException(message:<schema>.<hivetable> table not found) | Cause: The Hive table that you specified with <schema>.<hivetable> does not exist. Solution: Provide the name of an existing Hive table. |
Failed to connect to <segment-host> port 5888: Connection refused (libchurl.c:944) (<segment-id> slice<N> <segment-host>:40000 pid=<process-id>) … |
Cause: PXF is not running on <segment-host>. Solution: Restart PXF on <segment-host>. |
ERROR: failed to acquire resources on one or more segments DETAIL: could not connect to server: Connection refused Is the server running on host “<segment-host>” and accepting TCP/IP connections on port 40000?(seg<N> <segment-host>:40000) |
Cause: The Greenplum Database segment host <segment-host> is down. |
org.apache.hadoop.security.AccessControlException: Permission denied: user= |
Cause: The Greenplum Database user that executed the PXF operation does not have permission to access the underlying Hadoop service (HDFS or Hive). See Configuring User Impersonation and Proxying. |
PXF Logging
Enabling more verbose logging may aid PXF troubleshooting efforts. PXF provides two categories of message logging: service-level and client-level.
Service-Level Logging
PXF utilizes log4j
for service-level logging. PXF-service-related log messages are captured in a log file specified by PXF’s log4j
properties file, $PXF_CONF/conf/pxf-log4j.properties
. The default PXF logging configuration will write INFO
and more severe level logs to $PXF_CONF/logs/pxf-service.log
. You can configure the logging level and the log file location.
PXF provides more detailed logging when the DEBUG
level is enabled. To configure PXF DEBUG
logging and examine the output:
Log in to your Greenplum Database master node:
$ ssh gpadmin@<gpmaster>
Open
$PXF_CONF/conf/pxf-log4j.properties
in an editor, uncomment the following line, save the file, and exit the editor:#log4j.logger.org.greenplum.pxf=DEBUG
Use the
pxf cluster sync
command to copy the updatedpxf-log4j.properties
file to the Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
With
DEBUG
level logging now enabled, you can perform your PXF operations. Be sure to make note of the time; this will direct you to the relevant log messages in$PXF_CONF/logs/pxf-service.log
.$ date Wed Oct 4 09:30:06 MDT 2017 $ psql -d <dbname>
Create and query an external table. For example:
dbname=> CREATE EXTERNAL TABLE hdfstest(id int, newid int) LOCATION ('pxf://data/dir/hdfsfile?PROFILE=hdfs:text') FORMAT 'TEXT' (delimiter='E','); dbname=> SELECT * FROM hdfstest; <select output>
Finally, examine/collect the log messages from
pxf-service.log
.
Note: DEBUG
logging is quite verbose and has a performance impact. Remember to turn off PXF service DEBUG
logging after you have collected the desired information.
Client-Level Logging
Database-level client logging may provide insight into internal PXF service operations.
Enable Greenplum Database and PXF debug message logging during operations on PXF external tables by setting the client_min_messages
server configuration parameter to DEBUG2
in your psql
session.
$ psql -d <dbname>
dbname=# SET client_min_messages=DEBUG2;
dbname=# SELECT * FROM hdfstest;
...
DEBUG2: churl http header: cell #19: X-GP-URL-HOST: seghost1 (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest
DEBUG2: churl http header: cell #20: X-GP-URL-PORT: 5888 (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest
DEBUG2: churl http header: cell #21: X-GP-DATA-DIR: data/dir/hdfsfile (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest
DEBUG2: churl http header: cell #22: X-GP-OPTIONS-PROFILE: hdfs:text (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest
...
Examine/collect the log messages from stdout
.
Note: DEBUG2
database session logging has a performance impact. Remember to turn off DEBUG2
logging after you have collected the desired information.
dbname=# SET client_min_messages=NOTICE;
Addressing PXF Memory Issues
Because a single PXF agent (JVM) serves multiple segments on a segment host, the PXF heap size can be a limiting runtime factor. This will be more evident under concurrent workloads and/or queries against large files. You may run into situations where a query will hang or fail due to insufficient memory or the Java garbage collector impacting response times. To avert or remedy these situations, first try increasing the Java maximum heap size or decreasing the Tomcat maximum number of threads, depending upon what works best for your system configuration. You may also choose to configure PXF to perform specific actions when it detects an out of memory condition.
Note: The configuration changes described in this topic require modifying config files on each node in your Greenplum Database cluster. After you perform the updates on the master, be sure to synchronize the PXF configuration to the Greenplum Database cluster.
Configuring Out of Memory Condition Actions
In an out of memory (OOM) situation, PXF returns the following error in response to a query:
java.lang.OutOfMemoryError: Java heap space
You can configure the PXF JVM to enable/disable the following actions when it detects an OOM condition:
- Auto-kill the PXF server (enabled by default).
- Dump the Java heap (disabled by default).
Auto-Killing the PXF Server
By default, PXF is configured such that when the PXF JVM detects an out of memory condition on a segment host, it automatically runs a script that kills the PXF server running on the host. The PXF_OOM_KILL
configuration property governs this auto-kill behavior.
When auto-kill is enabled and the PXF JVM detects an OOM condition and kills the PXF server on the segment host:
PXF logs the following messages to
$PXF_CONF/logs/catalina.out
on the segment host:=====> <date> PXF Out of memory detected <====== =====> <date> PXF shutdown scheduled <======
Any query that you run on a PXF external table will fail with the following error until you restart the PXF server on the segment host:
... Failed to connect to <host> port 5888: Connection refused
When the PXF server on a segment host is shut down in this manner, you must explicitly restart the PXF server on the host. See the pxf reference page for more information on the pxf start
command.
Refer to the configuration procedure below for the instructions to disable/enable this PXF configuration property.
Dumping the Java Heap
In an out of memory situation, it may be useful to capture the Java heap dump to help determine what factors contributed to the resource exhaustion. You can use the PXF_OOM_DUMP_PATH
property to configure PXF to write the heap dump to a file when it detects an OOM condition. By default, PXF does not dump the Java heap on OOM.
If you choose to enable the heap dump on OOM, you must set PXF_OOM_DUMP_PATH
to the absolute path to a file or directory:
- If you specify a directory, the PXF JVM writes the heap dump to the file
<directory>/java_pid<pid>.hprof
, where<pid>
identifies the process ID of the PXF server instance. The PXF JVM writes a new file to the directory every time the JVM goes OOM. - If you specify a file and the file does not exist, the PXF JVM writes the heap dump to the file when it detects an OOM. If the file already exists, the JVM will not dump the heap.
Ensure that the gpadmin
user has write access to the dump file or directory.
Note: Heap dump files are often rather large. If you enable heap dump on OOM for PXF and specify a directory for PXF_OOM_DUMP_PATH
, multiple OOMs will generate multiple files in the directory and could potentially consume a large amount of disk space. If you specify a file for PXF_OOM_DUMP_PATH
, disk usage is constant when the file name does not change. You must rename the dump file or configure a different PXF_OOM_DUMP_PATH
to generate subsequent heap dumps.
Refer to the configuration procedure below for the instructions to enable/disable this PXF configuration property.
Procedure
Auto-kill of the PXF server on OOM is enabled by default. Heap dump generation on OOM is disabled by default. To configure one or both of these properties, perform the following procedure:
Log in to your Greenplum Database master node:
$ ssh gpadmin@<gpmaster>
Edit the
$PXF_CONF/conf/pxf-env.sh
file. For example:gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
If you want to configure (i.e. turn off, or turn back on) auto-kill of the PXF server on OOM, locate the
PXF_OOM_KILL
property in thepxf-env.sh
file. If the setting is commented out, uncomment it, and then update the value. For example, to turn off this behavior, set the value tofalse
:export PXF_OOM_KILL=false
If you want to configure (i.e. turn on, or turn back off) automatic heap dumping when the PXF server hits an OOM condition, locate the
PXF_OOM_DUMP_PATH
setting in thepxf-env.sh
file.To turn this behavior on, set the
PXF_OOM_DUMP_PATH
property value to the file system location to which you want the PXF JVM to dump the Java heap. For example, to dump to a file named/home/gpadmin/pxfoom_segh1
:export PXF_OOM_DUMP_PATH=/home/pxfoom_segh1
To turn off heap dumping after you have turned it on, comment out the
PXF_OOM_DUMP_PATH
property setting:#export PXF_OOM_DUMP_PATH=/home/pxfoom_segh1
Save the
pxf-enf.sh
file and exit the editor.Use the
pxf cluster sync
command to copy the updatedpxf-env.sh
file to the Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
Increasing the JVM Memory for PXF
Each PXF agent running on a segment host is configured with a default maximum Java heap size of 2GB and an initial heap size of 1GB. If the segment hosts in your Greenplum Database cluster have an ample amount of memory, try increasing the maximum heap size to a value between 3-4GB. Set the initial and maximum heap size to the same value if possible.
Perform the following procedure to increase the heap size for the PXF agent running on each segment host in your Greenplum Database cluster.
Log in to your Greenplum Database master node:
$ ssh gpadmin@<gpmaster>
Edit the
$PXF_CONF/conf/pxf-env.sh
file. For example:gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
Locate the
PXF_JVM_OPTS
setting in thepxf-env.sh
file, and update the-Xmx
and/or-Xms
options to the desired value. For example:PXF_JVM_OPTS="-Xmx3g -Xms3g"
Save the file and exit the editor.
Use the
pxf cluster sync
command to copy the updatedpxf-env.sh
file to the Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
Another Option for Resource-Constrained PXF Segment Hosts
If increasing the maximum heap size is not suitable for your Greenplum Database deployment, try decreasing the number of concurrent working threads configured for PXF’s underlying Tomcat web application. A decrease in the number of running threads will prevent any PXF node from exhausting its memory, while ensuring that current queries run to completion (albeit a bit slower). Tomcat’s default behavior is to queue requests until a thread is free, or the queue is exhausted.
The default maximum number of Tomcat threads for PXF is 200. The PXF_MAX_THREADS
configuration property controls this setting.
PXF thread capacity is determined by the profile and whether or not the data is compressed. If you plan to run large workloads on a large number of files in an external Hive data store, or you are reading compressed ORC or Parquet data, consider specifying a lower PXF_MAX_THREADS
value.
Note: Keep in mind that an increase in the thread count correlates with an increase in memory consumption when the thread count is exhausted.
Perform the following procedure to set the maximum number of Tomcat threads for the PXF agent running on each segment host in your Greenplum Database deployment.
Log in to your Greenplum Database master node:
$ ssh gpadmin@<gpmaster>
Edit the
$PXF_CONF/conf/pxf-env.sh
file. For example:gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
Locate the
PXF_MAX_THREADS
setting in thepxf-env.sh
file. Uncomment the setting and update it to the desired value. For example, to set the maximum number of Tomcat threads to 100:export PXF_MAX_THREADS=100
Save the file and exit the editor.
Use the
pxf cluster sync
command to copy the updatedpxf-env.sh
file to the Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
Addressing PXF JDBC Connector Time Zone Errors
You use the PXF JDBC connector to access data stored in an external SQL database. Depending upon the JDBC driver, the driver may return an error if there is a mismatch between the default time zone set for the PXF server and the time zone set for the external SQL database.
For example, if you use the PXF JDBC connector to access an Oracle database with a conflicting time zone, PXF logs an error similar to the following:
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: ORA-00604: error occurred at recursive SQL level 1
ORA-01882: timezone region not found
Should you encounter this error, you can set default time zone option(s) for the PXF server in the $PXF_CONF/conf/pxf-env.sh
configuration file, PXF_JVM_OPTS
property setting. For example, to set the time zone:
export PXF_JVM_OPTS="<current_settings> -Duser.timezone=America/Chicago"
You can use the PXF_JVM_OPTS
property to set other Java options as well.
As described in previous sections, you must synchronize the updated PXF configuration to the Greenplum Database cluster and restart the PXF server on each segment host.
PXF Fragment Metadata Caching
A PXF connector Fragmenter uses metadata from the external data source to split data into a list of fragments (blocks, files, etc.) that can be read in parallel. PXF caches the fragment metadata on a per-query basis: the first thread to access a fragment’s metadata stores the information in a cache, and other threads reuse this cached metadata. Caching of this nature reduces query memory requirements for external data sources with a large number of fragments.
PXF fragment metadata caching is enabled by default. To turn off fragment metadata caching, or to re-enable it after turning it off, perform the following procedure:
Log in to your Greenplum Database master node:
$ ssh gpadmin@<gpmaster>
Edit the
$PXF_CONF/conf/pxf-env.sh
file. For example:gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
Locate the
PXF_FRAGMENTER_CACHE
setting in thepxf-env.sh
file. If the setting is commented out, uncomment it, and then update the value. For example, to turn off fragment metadata caching, set the value tofalse
:export PXF_FRAGMENTER_CACHE=false
Save the file and exit the editor.
Use the
pxf cluster sync
command to copy the updatedpxf-env.sh
file to the Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Restart PXF on each Greenplum Database segment host as described in Restarting PXF.