Reading and Writing HDFS Parquet Data

Use the PXF HDFS connector to read and write Parquet-format data. This section decribes how to read and write HDFS files that are stored in Parquet format, including how to create, query, and insert into an external table that references files in the HDFS data store.

PXF currently supports reading and writing primitive Parquet data types only.

PXF Parquet write support is a Beta feature.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.

Data Type Mapping

To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type. The following table summarizes the external mapping rules:

Parquet Data Type PXF/Greenplum Data Type
boolean Boolean
byte_array Bytea, Text
double Float8
fixed_len_byte_array Numeric
float Real
int_8, int_16 Smallint, Integer
int64 Bigint
int96 Timestamp

Creating the External Table

The PXF HDFS connector hdfs:parquet profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

Use the following syntax to create a Greenplum Database external table that references an HDFS directory:

CREATE [WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
    ?PROFILE=hdfs:parquet[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:parquet.
<custom‑option>=<value> <custom-option>s are described below.
FORMAT ‘CUSTOM’ Use FORMATCUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
DISTRIBUTED BY If you plan to load the writable external table with data from an existing Greenplum Database table, consider specifying the same distribution policy or <column_name> on the writable external table as that defined for the table from which you plan to load the data. Doing so will avoid extra motion of data between segments on the load operation.

The PXF hdfs:parquet profile supports write compression. You specify the compression codec via a custom option in the CREATE EXTERNAL TABLE LOCATION clause. The hdfs:parquet profile supports the following custom options:

Option Value Description
COMPRESSION_CODEC The compression codec alias. Supported compression codecs for writing Parquet data include: snappy, gzip, lzo, and uncompressed . If this option is not provided, PXF compresses the data using snappy compression.

Note: You must explicitly specify uncompressed if you do not want PXF to compress the data.

Parquet files that you write to HDFS have the following naming format: <file>.<compress_extension>.parquet, for example 1547061635-0000004417_0.gz.parquet.

Example

This example utilizes the data schema introduced in Example: Reading Text Data on HDFS.

Column Name Data Type
location text
month text
number_of_orders int
total_sales float8

In this example, you create a Parquet-format writable external table that references Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.

  1. Use the hdfs:parquet profile to create a writable external table. For example:

    postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, total_sales double precision)
        LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
      FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
    
  2. Write a few records to the pxf_parquet HDFS directory by inserting directly into the pxf_tbl_parquet table. For example:

    postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
    postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
    
  3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in pxf_parquet, create a readable external Greenplum Database referencing this HDFS directory:

    postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, total_sales double precision)
        LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
        FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    
  4. Query the readable external table read_pxf_parquet:

    gpadmin=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
    
     location  | month | number_of_orders | total_sales
    -----------+-------+------------------+-------------
     Frankfurt | Mar   |              777 |     3956.98
     Cleveland | Oct   |             3812 |     96645.4
    (2 rows)