Reading and Writing HDFS Parquet Data
A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 6.x documentation.
Use the PXF HDFS connector to read and write Parquet-format data. This section describes how to read and write HDFS files that are stored in Parquet format, including how to create, query, and insert into external tables that reference files in the HDFS data store.
PXF currently supports reading and writing primitive Parquet data types only.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.
Data Type Mapping
To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type. The following table summarizes the external mapping rules:
Parquet Data Type | PXF/Greenplum Data Type |
---|---|
boolean | Boolean |
byte_array | Bytea, Text |
double | Float8 |
fixed_len_byte_array | Numeric |
float | Real |
int_8, int_16 | Smallint, Integer |
int64 | Bigint |
int96 | Timestamp, Timestamptz |
- PXF localizes a
timestamp
to the current system timezone and converts it to universal time (UTC) before finally converting toint96
. - PXF converts a
timestamptz
to a UTCtimestamp
and then converts toint96
. PXF loses the time zone information during this conversion.
Creating the External Table
The PXF HDFS connector hdfs:parquet
profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
Use the following syntax to create a Greenplum Database external table that references an HDFS directory:
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The absolute path to the directory in the HDFS data store. |
PROFILE | The PROFILE keyword must specify hdfs:parquet . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified. |
<custom‑option>=<value> | <custom-option>s are described below. |
FORMAT ‘CUSTOM’ | Use FORMAT ’CUSTOM ’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read). |
DISTRIBUTED BY | If you plan to load the writable external table with data from an existing Greenplum Database table, consider specifying the same distribution policy or <column_name> on the writable external table as that defined for the table from which you plan to load the data. Doing so will avoid extra motion of data between segments on the load operation. |
The PXF hdfs:parquet
profile supports encoding- and compression-related write options. You specify these write options in the CREATE WRITABLE EXTERNAL TABLE
LOCATION
clause. The hdfs:parquet
profile supports the following custom options:
Write Option | Value Description |
---|---|
COMPRESSION_CODEC | The compression codec alias. Supported compression codecs for writing Parquet data include: snappy , gzip , lzo , and uncompressed . If this option is not provided, PXF compresses the data using snappy compression. |
ROWGROUP_SIZE | A Parquet file consists of one or more row groups, a logical partitioning of the data into rows. ROWGROUP_SIZE identifies the size (in bytes) of the row group. The default row group size is 8 * 1024 * 1024 bytes. |
PAGE_SIZE | A row group consists of column chunks that are divided up into pages. PAGE_SIZE is the size (in bytes) of such a page. The default page size is 1024 * 1024 bytes. |
DICTIONARY_PAGE_SIZE | Dictionary encoding is enabled by default when PXF writes Parquet files. There is a single dictionary page per column, per row group. DICTIONARY_PAGE_SIZE is similar to PAGE_SIZE , but for the dictionary. The default dictionary page size is 512 * 1024 bytes. |
PARQUET_VERSION | The Parquet version; values v1 and v2 are supported. The default Parquet version is v1 . |
Note: You must explicitly specify uncompressed
if you do not want PXF to compress the data.
Parquet files that you write to HDFS with PXF have the following naming format: <file>.<compress_extension>.parquet
, for example 1547061635-0000004417_0.gz.parquet
.
Example
This example utilizes the data schema introduced in Example: Reading Text Data on HDFS.
Column Name | Data Type |
---|---|
location | text |
month | text |
number_of_orders | int |
total_sales | float8 |
In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.
Use the
hdfs:parquet
profile to create a writable external table. For example:postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, total_sales double precision) LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
Write a few records to the
pxf_parquet
HDFS directory by inserting directly into thepxf_tbl_parquet
table. For example:postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
Recall that Greenplum Database does not support directly querying a writable external table. To read the data in
pxf_parquet
, create a readable external Greenplum Database referencing this HDFS directory:postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, total_sales double precision) LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Query the readable external table
read_pxf_parquet
:postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
location | month | number_of_orders | total_sales -----------+-------+------------------+------------- Frankfurt | Mar | 777 | 3956.98 Cleveland | Oct | 3812 | 96645.4 (2 rows)