Tanzu Greenplum Platform Extension Framework 6.x Release Notes

The Tanzu Greenplum Platform Extension Framework (PXF) is included in the Tanzu Greenplum Database Server distribution in Greenplum version 6.18.x and older, and in version 5.28.0 and older. PXF for Redhat/CentOS and Oracle Enterprise Linux is updated and distributed independently of Greenplum Database starting with PXF version 5.13.0. PXF version 5.16.0 is the first independent release that includes an Ubuntu distribution.

You may need to download and install the PXF package to obtain the most recent version of this component.

Supported Platforms

The independent PXF 6.x distribution is compatible with these operating system platform and Greenplum Database versions:

OS Version Greenplum Version
RHEL 7.x, CentOS 7.x 5.21.2+, 6.x
OEL 7.x, Ubuntu 18.04 LTS 6.x

PXF is compatible with these Java and Hadoop component versions:

PXF Version Java Versions Hadoop Versions Hive Server Versions HBase Server Version
6.2.x, 6.1.0, 6.0.x 8, 11 2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2
5.16.x, 5.15.x, 5.14, 5.13 8, 11 2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2

Upgrading to PXF 6.2.x

If you are currently using PXF with Greenplum Database, you may be required to perform upgrade actions for this release. Review Upgrading from PXF 5 or Upgrading from an Earlier PXF 6 Release to plan your upgrade to PXF version 6.2.x.

Release 6.2.3

Release Date: February 2, 2022

Changed Features

PXF 6.2.3 includes these changes:

  • PXF bundles version 2.17.1 of the log4j2 library to mitigate CVE-2021-44832.
  • PXF updates the version of go that it uses to build the pxf CLI tool to version 1.17.6 to mitigate CVE-2021-44716.
  • PXF now writes early startup messages that were previously directed to stdout/stderr and ignored to the file $PXF_LOG_DIR/pxf_app.out.
  • PXF introduces a performance improvement when it iterates over a list of fragments.

Resolved Issues

PXF 6.2.3 resolves these issues:

Issue # Summary
CVE‑2021‑44832 Updates the bundled log4j2 library to version 2.17.1. (Resolved by PR-735.)
CVE‑2021‑44716 Updates the go library to version 1.17.6. (Resolved by PR-740.)

Release 6.2.2

Release Date: December 22, 2021

Changed Features

PXF 6.2.2 includes these changes:

  • PXF bundles version 2.17.0 of the log4j2 library to mitigate CVE-2021-45105.
  • PXF downgrades the bundled version of Spring Boot to resolve issue 31927.

Resolved Issues

PXF 6.2.2 resolves these issues:

Issue # Summary
CVE‑2021‑45105 Updates the bundled log4j2 library to version 2.17.0. (Resolved by PR-733.)
31927 Resolves an issue where the PXF C extension reported a partial file transfer error when a data-less response that the PXF server sent to Greenplum Database failed to include a zero-length chunk. PXF 6.2.2 downgrades the bundled version of Spring Boot to 2.4.3 which does not exhibit the error behavior. (Resolved by PR-732.)

Release 6.2.1

Release Date: December 17, 2021

Changed Features

PXF 6.2.1 includes these changes:

  • PXF bundles version 2.16.0 of the log4j2 library to mitigate CVE-2021-44228 and CVE-2021-45046.
  • PXF now returns an UnsupportedOperationException when it accesses a Hive transactional table.
  • PXF now supports the SKIP_HEADER_COUNT option for external tables that specified a *:text:multi profile.
  • When reading from a MySQL database, PXF now uses a jdbc.statement.fetchSize default value of -2147483648 (Integer.MIN_VALUE). This setting enables the MySQL JDBC driver to stream the results from a MySQL server, lessening the memory requirements when reading large data sets.
  • The PXF Hive connector now uses the hive-site.xml hive.metastore.failure.retries property setting to identify the maximum number of times to retry a failed connection to the Hive MetaStore. The default value is one retry. Addressing Hive MetaStore Connection Errors describes when and how to configure this property.

Resolved Issues

PXF 6.2.1 resolves these issues:

Issue # Summary
CVE‑2021‑45046 Updates the bundled log4j2 library to version 2.16.0. (Resolved by PR-727.)
CVE‑2021‑44228 Updates the bundled log4j2 library to version 2.15.0. (Resolved by PR-723.)
31955 Resolves an issue where PXF failed to access a Hive table due to a MetaStore connection issue. PXF now includes retry logic for the MetaStore connection based on the hive.metastore.failure.retries property setting in the hive-site.xml file. (Resolved by PR‑726.)
31948 Resolves an issue where PXF ran out of memory when it read a large data set from a MySQL database. PXF now uses a jdbc.statement.fetchSize default value of -2147483648 (Integer.MIN_VALUE) when it accesses MySQL, which streams the results from a MySQL server to PXF. (Resolved by PR‑721.)
31906 Resolves an issue where PXF returned 0 rows when a query was performed on a Hive transactional table instead of reporting that transactional tables are unsupported. PXF now more clearly identifies the problem by returning an UnsupportedOperationException and the error: PXF does not support Hive transactional tables. (Resolved by PR-719.)
31791 Resolves an issue where PXF ignored the SKIP_HEADER_COUNT custom option when it read from an external data source via an external table that specified a *:text:multi profile. PXF now recognizes and implements this option for *:text:multi profiles. (Resolved by PR-710.)

Release 6.2.0

Release Date: September 13, 2021

New and Changed Features

PXF 6.2.0 includes these new and changed features:

  • PXF adds support for reading a JSON array into a Greenplum Database text array (TEXT[]). Refer to Working with JSON Data for additional information.
  • PXF adds support for reading lists of certain ORC scalar types into a Greenplum Database array of native type. Refer to the PXF ORC data type mapping documentation for more information about the data type mapping.
  • PXF bundles newer versions of ORC, Spring Boot, and other dependent libraries.
  • PXF improves its message logging by:

    • Better aligning the log message text.
    • Also logging the affected fragment when it encounters a read error.
  • PXF introduces a new property to the pxf-site.xml per-server configuration file. PXF uses this property, pxf.sasl.connection.retries, to specify the maximum number of times that it retries a SASL connection request to an external data source after a refused connection returns a GSS initiate failed error.

  • PXF introduces a new PXF Service application property, pxf.fragmenter-cache.expiration, to specify the amount of time after which an entry expires and is removed from the fragment cache.

Resolved Issues

PXF 6.2.0 resolves these issues:

Issue # Summary
Resolves an issue when using the jdbc profile to write data to a Hive table. The Hive JDBC driver always returned 0 when executing an update, and PXF would return an error even if the INSERT executed correctly. (Resolved by PR-662.)
31675 Resolves a fragment cache issue that appeared when an external table was re-created within the same transaction in a stored procedure, and the new external table referenced a different LOCATION. (Resolved by PR-691.)
31657 Queries on an external table intermittently failed in some Kerberos-secured environments because the Hadoop NameNode erroneously detected a replay attack during Kerberos authentication. This issue is resolved by PR-688.
31571 PXF did not support ORC lists. PXF 6.2.0 includes support for reading lists of certain ORC scalar types into a Greenplum Database array of native type. (Resolved by PR-675.)
31326 PXF did not support reading a JSON array into a Greenplum Database array-type column. PXF 6.2.0 includes support for reading a JSON array into a text array (TEXT[]). (Resolved by PR-646.)
683 Resolves an issue where PXF incorrectly casted an enum value from the external data source to a string. (Resolved by PR-696.)

Release 6.1.0

Release Date: June 24, 2021

New and Changed Features

PXF 6.1.0 includes these new and changed features:

  • PXF now natively supports reading and writing Avro arrays.
  • PXF adds support for reading JSON objects, such as embedded arrays, as text. The data returned by PXF is a valid JSON string that you can manipulate with the existing Greenplum Database JSON functions and operators.
  • PXF improves its error reporting by displaying the exception class when there is no error message available.
  • PXF introduces a new property that you can use to configure the connection timeout for data upload/write operations to an external datastore. This property is named pxf.connection.upload-timeout, and is located in the pxf-application.properties file.
  • PXF now uses the pxf.connection.timeout configuration property to set the connection timeout only for read operations. If you previously set this property to specify the write timeout, you should now use pxf.connection.upload-timeout instead.
  • PXF bundles a newer gp-common-go-libs supporting library along with its dependencies.

Resolved Issues

PXF 6.1.0 resolves these issues:

Issue # Summary
31389 Resolves an issue where certain pxf cluster commands returned the error connect: no such file or directory when the current working directory contained a directory with the same name as the hostname. This issue was resolved by upgrading a dependent library. (Resolved by PR-633.)
31317 PXF did not support writing Avro arrays. PXF 6.1.0 includes native support for reading and writing Avro arrays. (Resolved by PR-636.)

Release 6.0.1

Release Date: May 11, 2021

Resolved Issues

PXF 6.0.1 resolves these issues:

Issue # Summary
Resolves an issue where PXF returned wrong results for batches of ORC data that were shorter than the default batch size. (Resolved by PR-630.)
Resolves an issue where PXF threw a NullPointerException when it encountered a repeating ORC column value of type string. (Resolved by PR-627.)
178013439 Resolves an issue where using the profile HiveVectorizedORC did not result in vectorized execution. (Resolved by PR-624.)
31409 Resolves an issue where PXF intermittently failed with the error ERROR: PXF server error(500) : Failed to initialize HiveResolver when it accessed Hive tables STORED AS ORC. (Resolved by PR-626.)

Release 6.0.0

Release Date: March 29, 2021

New and Changed Features

PXF 6.0.0 includes these new and changed features:

Architecture and Bundled Libraries

  • PXF 6.0.0 is built on the Spring Boot framework:

    • PXF distributes a single JAR file that includes all of its dependencies.
    • PXF no longer installs and uses a standalone Tomcat server; it uses the Tomcat version 9.0.43 embedded in the PXF Spring Boot application.
  • PXF bundles the postgresql-42.2.14.jar PostgreSQL driver JAR file.

  • PXF library dependencies have changed with new, updated, and removed libraries.

  • The PXF API has changed. If you are upgrading from PXF 5.x, you must update the PXF extension in each database in which it is registered as described in Upgrading from PXF 5.

  • PXF 6 moves fragment allocation from its C extension to the PXF Service running on each segment host.

  • The PXF Service now also runs on the Greenplum Database master and standby master hosts. If you used PXF 5.x to access Kerberos-secured HDFS, you must now generate principals and keytabs for the master and standby master as described in Upgrading from PXF 5.

Files, Configuration, and Commands

  • PXF 6 uses the $PXF_BASE environment variable to identify its runtime configuration directory; it no longer uses $PXF_CONF for this purpose.
  • By default, PXF installs its executables and runtime configuration into the same directory, $PXF_HOME, and PXF_BASE=$PXF_HOME. See About the PXF Installation and Configuration Directories for the new installation file layout.
  • You can relocate the $PXF_BASE runtime configuration directory to a different directory after you install PXF by running the new pxf [cluster] prepare command as described in Relocating $PXF_BASE.
  • PXF template server configuration files now reside in $PXF_HOME/templates; they were previously located in the $PXF_CONF/templates directory.
  • The pxf [cluster] register command now copies only the PXF pxf.control extension file to the Greenplum Database installation. Run this command after your first installation of PXF, and/or after you upgrade your Greenplum Database installation.
  • PXF 6 no longer requires initialization, and deprecates the init and reset commands. pxf [cluster] init is now equivalent to pxf [cluster] register, and pxf [cluster] reset is a no-op.
  • PXF 6 includes new and changed configuration; see About the PXF Configuration Files for more information:

    • PXF 6 integrates with Apache Log4j 2; the PXF logging configuration file is now named pxf-log4j2.xml, and is in xml format.
    • PXF 6 adds a new configuration file for the PXF server application, pxf-application.properties; this file includes:

      • New properties to configure the PXF streaming thread pool.
      • New pxf.log.level property to set the PXF logging level.
      • Configuration properties moved from the PXF 5 pxf-env.sh file and renamed:

        pxf-env.sh Property Name pxf-application.properties Property Name
        PXF_MAX_THREADS pxf.max.threads
    • PXF 6 adds new configuration environment variables to pxf-env.sh to simplify the registration of external library dependencies:

      New Property Name Description
      PXF_LOADER_PATH Additional directories and JARs for PXF to class-load.
      LD_LIBRARY_PATH Additional directories and native libraries for PXF to load.


      See Registering PXF Library Dependencies for more information.

    • PXF 6 deprecates the PXF_FRAGMENTER_CACHE configuration property; fragment metadata caching is no longer configurable and is now always enabled.

Profiles

  • PXF 6 introduces new profile names and deprecates some older profile names. The old profile names still work, but it is highly recommended to switch to using the new profile names:

    New Profile Name Old/Deprecated Profile Name
    hive Hive
    hive:rc HiveRC
    hive:orc HiveORC
    hive:orc HiveVectorizedORC1
    hive:text HiveText
    jdbc Jdbc
    hbase HBase


    1 To use the HiveVectorizedORC profile in PXF 6, specify the hive:orc profile name with the new VECTORIZE=true custom option.

  • PXF adds support for natively reading an ORC file located in Hadoop, an object store, or a network file system. See the Hadoop ORC and Object Store ORC documentation for prerequisites and usage information.

  • PXF adds support for reading and writing comma-separated value form text data located in Hadoop, an object store, or a network file system though a separate CSV profile. See the Hadoop Text and Object Store Text documentation for usage information.

  • PXF supports predicate pushdown on VARCHAR data types.

  • PXF supports predicate pushdown for the IN operator when you specify one of the *:parquet profiles to read a parquet file.

  • PXF supports specifying a codec short name (alias) rather than the Java class name when you create a writable external table for a *:text, *:csv, or *:SequenceFile profile that includes a COMPRESSION_CODEC.

Monitoring

Logging

  • PXF improves the display of error messages in the psql client, in some cases including a HINT that provides possible error resolution actions.
  • When PXF is configured to auto-terminate on detection of an out of memory condition, it now logs messages to $PXF_LOGDIR/pxf-oom.log rather than catalina.out.

Removed Features

PXF version 6.0.0 removes:

  • The THREAD-SAFE external table custom option (deprecated since 5.10.0).
  • The PXF_USER_IMPERSONATION, PXF_PRINCIPAL, and PXF_KEYTAB configuration properties in pxf-env.sh (deprecated since 5.10.0).
  • The jdbc.user.impersonation configuration property in jdbc-site.xml (deprecated since 5.10.0).
  • The Hadoop profile names HdfsTextSimple, HdfsTextMulti, Avro, Json, Parquet, and SequenceWritable (deprecated since 5.0.1).

Resolved Issues

PXF 6.0.0 resolves these issues:

Issue # Summary
30987 Resolves an issue where PXF returned an out of memory error while executing a query on a Hive table backed by a large number of files when it could not enlarge a string buffer during the fragmentation process. PXF 6.0.0 moves fragment distribution logic and fragment allocation to the PXF Service running on each segment host.

Deprecated Features

Deprecated features may be removed in a future major release of PXF. PXF version 6.x deprecates:

  • The PXF_FRAGMENTER_CACHE configuration property (deprecated since PXF version 6.0.0).
  • The pxf [cluster] init commands (deprecated since PXF version 6.0.0).
  • The pxf [cluster] reset commands (deprecated since PXF version 6.0.0).
  • The Hive profile names Hive, HiveText, HiveRC, HiveORC, and HiveVectorizedORC (deprecated since PXF version 6.0.0). Refer to Connectors, Data Formats, and Profiles in the PXF Hadoop documentation for the new profile names.
  • The HBase profile name (now hbase) (deprecated since PXF version 6.0.0).
  • The Jdbc profile name (now jdbc) (deprecated since PXF version 6.0.0).
  • Specifying a COMPRESSION_CODEC using the Java class name; use the codec short name instead.

Known Issues and Limitations

PXF 6.x has these known issues and limitations:

Issue # Description
178013439 (Resolved in 6.0.1) Using the deprecated HiveVectorizedORC profile does not result in vectorized execution.
Workaround: Use the hive:orc profile with the option VECTORIZE=true.
31409 (Resolved in 6.0.1) PXF can intermittently fail with the following error when it accesses Hive tables STORED AS ORC:
ERROR: PXF server error(500) : Failed to initialize HiveResolver
Workaround: Use vectorized query execution by adding the VECTORIZE=true custom option to the LOCATION URL. (Note that PXF does not support predicate pushdown, complex types, and the timestamp data type with ORC vectorized execution.)
168957894 The PXF Hive Connector does not support using the hive[:*] profiles to access Hive 3 managed (CRUD and insert-only transactional, and temporary) tables.
Workaround: Use the PXF JDBC Connector to access Hive 3 managed tables.