Using PXF to Read and Write External Data

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.

The Greenplum Platform Extension Framework (PXF) implements a protocol named pxf that you can use to create an external table that references data in an external data store. The PXF protocol and Java service are packaged as a Greenplum Database extension.

You must enable the PXF extension in each database in which you plan to use the framework to access external data. You must also explicitly GRANT permission to the pxf protocol to those users/roles who require access.

After the extension is registered and privileges are assigned, you can use the CREATE EXTERNAL TABLE command to create an external table using the pxf protocol. PXF provides built-in HDFS, Hive, HBase, and JDBC connectors. These connectors define profiles that support varied external data sources and file formats. You specify the profile name in the CREATE EXTERNAL TABLE command LOCATION URI.

Enabling/Disabling PXF

You must explicitly enable the PXF extension in each Greenplum Database in which you plan to use the extension.

Note: You must have Greenplum Database administrator privileges to create an extension.

Enable Procedure

Perform the following procedure for each database in which you want to use PXF:

  1. Connect to the database as the gpadmin user:

    gpadmin@gpmaster$ psql -d <database-name> -U gpadmin
    
  2. Create the PXF extension. You must have Greenplum Database administrator privileges to create an extension. For example:

    database-name=# CREATE EXTENSION pxf;
    

    Creating the pxf extension registers the pxf protocol and the call handlers required for PXF to access external data.

Disable Procedure

When you no longer want to use PXF on a specific database, you must explicitly disable the PXF extension for that database:

  1. Connect to the database as the gpadmin user:

    gpadmin@gpmaster$ psql -d <database-name> -U gpadmin
    
  2. Drop the PXF extension:

    database-name=# DROP EXTENSION pxf;
    

    The DROP command fails if there are any currently defined external tables using the pxf protocol. Add the CASCADE option if you choose to forcibly remove these external tables.

Granting Access to PXF

To read external data with PXF, you create an external table with the CREATE EXTERNAL TABLE command that specifies the pxf protocol. You must specifically grant SELECT permission to the pxf protocol to all non-SUPERUSER Greenplum Database roles that require such access.

To grant a specific role access to the pxf protocol, use the GRANT command. For example, to grant the role named bill read access to data referenced by an external table created with the pxf protocol:

GRANT SELECT ON PROTOCOL pxf TO bill;

To write data to an external data store with PXF, you create an external table with the CREATE WRITABLE EXTERNAL TABLE command that specifies the pxf protocol. You must specifically grant INSERT permission to the pxf protocol to all non-SUPERUSER Greenplum Database roles that require such access. For example:

GRANT INSERT ON PROTOCOL pxf TO bill;

Configuring Filter Pushdown

PXF supports filter pushdown. When filter pushdown is enabled, the constraints from the WHERE clause of a SELECT query can be extracted and passed to the external data source for filtering. This process can improve query performance, and can also reduce the amount of data that is transferred to Greenplum Database.

You enable or disable filter pushdown for all external table protocols, including pxf, by setting the gp_external_enable_filter_pushdown server configuration parameter. The default value of this configuration parameter is on; set it to off to disable filter pushdown. For example:

SHOW gp_external_enable_filter_pushdown;
SET gp_external_enable_filter_pushdown TO 'on';

Note: Some external data sources do not support filter pushdown. Also, filter pushdown may not be supported with certain data types or operators. If a query accesses a data source that does not support filter push-down for the query constraints, the query is instead executed without filter pushdown (the data is filtered after it is transferred to Greenplum Database).

PXF accesses data sources using different connectors, and filter pushdown support is determined by the specific connector implementation. The following PXF connectors support filter pushdown:

  • Hive Connector
  • HBase Connector
  • JDBC Connector

PXF filter pushdown can be used with these data types:

  • INT
  • FLOAT
  • NUMERIC
  • BOOL
  • CHAR, TEXT
  • DATE, TIMESTAMP (JDBC connector only)

PXF filter pushdown can be used with these operators:

  • <, <=, >=, >
  • <>, =
  • IN operator on arrays of INT and TEXT
  • LIKE (only for TEXT fields) (JDBC connector only)

To summarize, all of the following criteria must be met for filter pushdown to occur:

  • You enable external table filter pushdown by setting the gp_external_enable_filter_pushdown server configuration parameter to 'on'.
  • The Greenplum Database protocol that you use to access external data source must support filter pushdown. The pxf external table protocol supports pushdown.
  • The external data source that you are accessing must support pushdown. For example, HBase and Hive support pushdown.
  • For queries on external tables that you create with the pxf protocol, the underlying PXF connector must also support filter pushdown. For example, the PXF Hive, HBase, and JDBC connectors support pushdown.

PXF Profiles

PXF is installed with HDFS, Hive, HBase, and JDBC connectors that provide a number of built-in profiles. These profiles simplify and unify access to external data sources of varied formats. You provide the profile name when you specify the pxf protocol on a CREATE EXTERNAL TABLE command to create a Greenplum Database external table referencing an external data store.

PXF provides the following built-in profiles. Those profiles that support write access are identified as such in their Description.

Data Source Data Format Profile Name(s) Description
HBase Many HBase Any data type that can be converted to an array of bytes.
HDFS Text HdfsTextSimple Delimited single line records from plain text files on HDFS. Supports write.
HDFS Text HdfsTextMulti Delimited single or multi-line records with quoted linefeeds from plain text files on HDFS.
HDFS Avro Avro Avro format binary files (<filename>.avro).
HDFS JSON Json JSON format data files (<filename>.json).
HDFS Binary SequenceWritable Binary files. Supports write.
HDFS Parquet Parquet Parquet compressed columnar data stored in an HDFS file. Supports read.
Hive TextFile Hive, HiveText Data in comma-, tab-, or space-separated value format or JSON notation.
Hive RCFile Hive, HiveRC Record columnar data consisting of binary key/value pairs.
Hive SequenceFile Hive Data consisting of binary key/value pairs.
Hive ORC Hive, HiveORC, HiveVectorizedORC Optimized row columnar data with stripe, footer, and postscript sections.
Hive Parquet Hive Compressed columnar data representation.
SQL Database Many Jdbc Read from and write to external SQL database.

A PXF profile definition includes the name of the profile, a description, and the Java classes that implement parsing and reading external data for the profile. Built-in PXF profiles are defined in the $GPHOME/pxf/conf/pxf-profiles-default.xml configuration file. The built-in HdfsTextSimple profile definition is reproduced below:

<profile>
    <name>HdfsTextSimple</name>
    <description>This profile is suitable for using when reading
        delimited single line records from plain text files on HDFS
    </description>
    <plugins>
        <fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
        <accessor>org.greenplum.pxf.plugins.hdfs.LineBreakAccessor</accessor>
        <resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
    </plugins>
</profile>

Note: Profile <plugins> identify the Java classes that PXF uses to parse and access the external data. The typical PXF user need not concern themselves with the profile plugins.

PXF JAR Dependencies

You use PXF to access data stored on external systems. Depending upon the external data store, this access may require that you install and/or configure additional components or services for the external data store. For example, to use PXF to access a file stored in HDFS, you must install a Hadoop client on each Greenplum Database segment host.

PXF depends on JAR files and other configuration information provided by these additional components. The $GPHOME/pxf/conf/pxf-private.classpath and $GPHOME/pxf/conf/pxf-public.classpath configuration files identify PXF JAR dependencies. In most cases, PXF manages the pxf-private.classpath file, adding entries as necessary based on your Hadoop distribution and optional Hive and HBase client installations.

Should you need to add additional JAR dependencies for PXF, for example a JDBC driver JAR file, you must add them to the pxf-public.classpath file on each segment host, and then restart PXF on each host.

Creating an External Table using PXF

The syntax for a CREATE EXTERNAL TABLE command that specifies the pxf protocol is as follows:

CREATE [WRITABLE] EXTERNAL TABLE <table_name>
        ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION('pxf://<path-to-data>?PROFILE[&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);

The LOCATION clause in a CREATE EXTERNAL TABLE statement specifying the pxf protocol is a URI that identifies the path to, or other information describing, the location of the external data. For example, if the external data store is HDFS, the <path-to-data> would identify the absolute path to a specific HDFS file. If the external data store is Hive, <path-to-data> would identify a schema-qualified Hive table name.

Use the query portion of the URI, introduced by the question mark (?), to identify the PXF profile name.

You will provide profile-specific information using the optional &<custom-option>=<value> component of the LOCATION string and formatting information via the <formatting-properties> component of the string. The custom options and formatting properties supported by a specific profile are identified later in usage documentation.

Greenplum Database passes the parameters in the LOCATION string as headers to the PXF Java service.

Table 1. Create External Table Parameter Values and Descriptions

Keyword Value and Description
<path‑to‑data> A directory, file name, wildcard pattern, table name, etc. The syntax of <path-to-data> is dependent upon the profile currently in use.
PROFILE The profile PXF uses to access the data. PXF supports HBase, HDFS, Hive, and JDBC connectors that expose profiles named HBase, HdfsTextSimple, HdfsTextMulti, Avro, Json, SequenceWritable, Hive, HiveText, HiveRC, HiveORC, HiveVectorizedORC, and Jdbc.
<custom‑option>=<value> Additional options and values supported by the profile. 
FORMAT <value> PXF profiles support the ’TEXT’, ’CSV’, and ’CUSTOMFORMATs.
<formatting‑properties> Formatting properties supported by the profile; for example, the formatter or delimiter.  

Note: When you create a PXF external table, you cannot use the HEADER option in your FORMAT specification.