Reading and Writing HDFS Text Data
The PXF HDFS Connector supports plain delimited and comma-separated value form text data. This section describes how to use PXF to access HDFS text data, including how to create, query, and insert data into an external table that references files in the HDFS data store.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.
Reading Text Data
Use the hdfs:text
profile when you read plain text delimited or .csv data where each row is a single record. The following syntax creates a Greenplum Database readable external table that references such a text file on HDFS:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:text[&SERVER=<server_name>]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The absolute path to the directory or file in the HDFS data store. |
PROFILE | The PROFILE keyword must specify hdfs:text . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified. |
FORMAT | Use FORMAT 'TEXT' when <path-to-hdfs-file> references plain text delimited data.Use FORMAT 'CSV' when <path-to-hdfs-file> references comma-separated value data. |
delimiter | The delimiter character in the data. For FORMAT 'CSV' , the default <delim_value> is a comma , . Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t') , (delimiter ':') . |
Note: PXF does not support CSV files with a header row, nor does it support the (HEADER)
formatter option in the CREATE EXTERNAL TABLE
command.
Example: Reading Text Data on HDFS
Perform the following procedure to create a sample text file, copy the file to HDFS, and use the hdfs:text
profile and the default PXF server to create two PXF external tables to query the data:
Create an HDFS directory for PXF example data files. For example:
$ hdfs dfs -mkdir -p /data/pxf_examples
Create a delimited plain text data file named
pxf_hdfs_simple.txt
:$ echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt
Note the use of the comma
,
to separate the four data fields.Add the data file to HDFS:
$ hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
Display the contents of the
pxf_hdfs_simple.txt
file stored in HDFS:$ hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
Start the
psql
subsystem:$ psql -d postgres
Use the PXF
hdfs:text
profile to create a Greenplum Database external table that references thepxf_hdfs_simple.txt
file that you just created and added to HDFS:postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text') FORMAT 'TEXT' (delimiter=E',');
Query the external table:
postgres=# SELECT * FROM pxf_hdfs_textsimple;
location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 (4 rows)
Create a second external table that references
pxf_hdfs_simple.txt
, this time specifying theCSV
FORMAT
:postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text') FORMAT 'CSV'; postgres=# SELECT * FROM pxf_hdfs_textsimple_csv;
When you specify
FORMAT 'CSV'
for comma-separated value data, nodelimiter
formatter option is required because comma is the default delimiter value.
Reading Text Data with Quoted Linefeeds
Use the hdfs:text:multi
profile to read plain text data with delimited single- or multi- line records that include embedded (quoted) linefeed characters. The following syntax creates a Greenplum Database readable external table that references such a text file on HDFS:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:text:multi[&SERVER=<server_name>]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The absolute path to the directory or file in the HDFS data store. |
PROFILE | The PROFILE keyword must specify hdfs:text:multi . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified. |
FORMAT | Use FORMAT 'TEXT' when <path-to-hdfs-file> references plain text delimited data.Use FORMAT 'CSV' when <path-to-hdfs-file> references comma-separated value data. |
delimiter | The delimiter character in the data. For FORMAT 'CSV' , the default <delim_value> is a comma , . Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t') , (delimiter ':') . |
Example: Reading Multi-Line Text Data on HDFS
Perform the following steps to create a sample text file, copy the file to HDFS, and use the PXF hdfs:text:multi
profile and the default PXF server to create a Greenplum Database readable external table to query the data:
Create a second delimited plain text file:
$ vi /tmp/pxf_hdfs_multi.txt
Copy/paste the following data into
pxf_hdfs_multi.txt
:"4627 Star Rd. San Francisco, CA 94107":Sept:2017 "113 Moon St. San Diego, CA 92093":Jan:2018 "51 Belt Ct. Denver, CO 90123":Dec:2016 "93114 Radial Rd. Chicago, IL 60605":Jul:2017 "7301 Brookview Ave. Columbus, OH 43213":Dec:2018
Notice the use of the colon
:
to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line feed separating the street address from the city and state.Copy the text file to HDFS:
$ hdfs dfs -put /tmp/pxf_hdfs_multi.txt /data/pxf_examples/
Use the
hdfs:text:multi
profile to create an external table that references thepxf_hdfs_multi.txt
HDFS file, making sure to identify the:
(colon) as the field separator:postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int) LOCATION ('pxf://data/pxf_examples/pxf_hdfs_multi.txt?PROFILE=hdfs:text:multi') FORMAT 'CSV' (delimiter ':');
Notice the alternate syntax for specifying the
delimiter
.Query the
pxf_hdfs_textmulti
table:postgres=# SELECT * FROM pxf_hdfs_textmulti;
address | month | year --------------------------+-------+------ 4627 Star Rd. | Sept | 2017 San Francisco, CA 94107 113 Moon St. | Jan | 2018 San Diego, CA 92093 51 Belt Ct. | Dec | 2016 Denver, CO 90123 93114 Radial Rd. | Jul | 2017 Chicago, IL 60605 7301 Brookview Ave. | Dec | 2018 Columbus, OH 43213 (5 rows)
Writing Text Data to HDFS
The PXF HDFS connector “hdfs:text” profile supports writing single line plain text data to HDFS. When you create a writable external table with the PXF HDFS connector, you specify the name of a directory on HDFS. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
Note: External tables that you create with a writable profile can only be used for INSERT
operations. If you want to query the data that you inserted, you must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database writable external table that references an HDFS directory:
CREATE WRITABLE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑dir> | The absolute path to the directory in the HDFS data store. |
PROFILE | The PROFILE keyword must specify hdfs:text |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. Optional; PXF uses the default server if not specified. |
<custom‑option> | <custom-option>s are described below. |
FORMAT | Use FORMAT 'TEXT' to write plain, delimited text to <path-to-hdfs-dir>.Use FORMAT 'CSV' to write comma-separated value text to <path-to-hdfs-dir>. |
delimiter | The delimiter character in the data. For FORMAT 'CSV' , the default <delim_value> is a comma , . Preface the <delim_value> with an E when the value is an escape sequence. Examples: (delimiter=E'\t') , (delimiter ':') . |
DISTRIBUTED BY | If you plan to load the writable external table with data from an existing Greenplum Database table, consider specifying the same distribution policy or <column_name> on the writable external table as that defined for the table from which you plan to load the data. Doing so will avoid extra motion of data between segments on the load operation. |
Writable external tables that you create using the hdfs:text
profile can optionally use record or block compression. The PXF hdfs:text
profile supports the following compression codecs:
- org.apache.hadoop.io.compress.DefaultCodec
- org.apache.hadoop.io.compress.GzipCodec
- org.apache.hadoop.io.compress.BZip2Codec
You specify the compression codec via custom options in the CREATE EXTERNAL TABLE
LOCATION
clause. The hdfs:text
profile support the following custom write options:
Option | Value Description |
---|---|
COMPRESSION_CODEC | The compression codec Java class name. If this option is not provided, Greenplum Database performs no data compression. Supported compression codecs include:org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.BZip2Codec org.apache.hadoop.io.compress.GzipCodec |
COMPRESSION_TYPE | The compression type to employ; supported values are RECORD (the default) or BLOCK . |
THREAD-SAFE | Boolean value determining if a table query can run in multi-threaded mode. The default value is TRUE . Set this option to FALSE to handle all requests in a single thread for operations that are not thread-safe (for example, compression). |
Example: Writing Text Data to HDFS
This example utilizes the data schema introduced in Example: Reading Text Data on HDFS.
Column Name | Data Type |
---|---|
location | text |
month | text |
number_of_orders | int |
total_sales | float8 |
This example also optionally uses the Greenplum Database external table named pxf_hdfs_textsimple
that you created in that exercise.
Procedure
Perform the following procedure to create Greenplum Database writable external tables utilizing the same data schema as described above, one of which will employ compression. You will use the PXF hdfs:text
profile and the default PXF server to write data to the underlying HDFS directory. You will also create a separate, readable external table to read the data that you wrote to the HDFS directory.
Create a Greenplum Database writable external table utilizing the data schema described above. Write to the HDFS directory
/data/pxf_examples/pxfwritable_hdfs_textsimple1
. Create the table specifying a comma,
as the delimiter:postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text') FORMAT 'TEXT' (delimiter=',');
You specify the
FORMAT
subclausedelimiter
value as the single ascii comma character,
.Write a few individual records to the
pxfwritable_hdfs_textsimple1
HDFS directory by invoking the SQLINSERT
command onpxf_hdfs_writabletbl_1
:postgres=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); postgres=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
(Optional) Insert the data from the
pxf_hdfs_textsimple
table that you created in Example: Reading Text Data on HDFS intopxf_hdfs_writabletbl_1
:postgres=# INSERT INTO pxf_hdfs_writabletbl_1 SELECT * FROM pxf_hdfs_textsimple;
In another terminal window, display the data that you just added to HDFS:
$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/* Frankfurt,Mar,777,3956.98 Cleveland,Oct,3812,96645.37 Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67
Because you specified comma
,
as the delimiter when you created the writable external table, this character is the field separator used in each record of the HDFS data.Greenplum Database does not support directly querying a writable external table. To query the data that you just added to HDFS, you must create a readable external Greenplum Database table that references the HDFS directory:
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_r1(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text') FORMAT 'CSV';
You specify the
'CSV'
FORMAT
when you create the readable external table because you created the writable table with a comma,
as the delimiter character, the default delimiter for'CSV'
FORMAT
.Query the readable external table:
postgres=# SELECT * FROM pxf_hdfs_textsimple_r1 ORDER BY total_sales;
location | month | num_orders | total_sales -----------+-------+------------+------------- Rome | Mar | 87 | 1557.39 Frankfurt | Mar | 777 | 3956.98 Prague | Jan | 101 | 4875.33 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 Cleveland | Oct | 3812 | 96645.37 (6 rows)
The
pxf_hdfs_textsimple_r1
table includes the records you individually inserted, as well as the full contents of thepxf_hdfs_textsimple
table if you performed the optional step.Create a second Greenplum Database writable external table, this time using Gzip compression and employing a colon
:
as the delimiter:postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_2 (location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple2?PROFILE=hdfs:text&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec') FORMAT 'TEXT' (delimiter=':');
Write a few records to the
pxfwritable_hdfs_textsimple2
HDFS directory by inserting directly into thepxf_hdfs_writabletbl_2
table:gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
In another terminal window, display the contents of the data that you added to HDFS; use the
-text
option tohdfs dfs
to view the compressed data as text:$ hdfs dfs -text /data/pxf_examples/pxfwritable_hdfs_textsimple2/* Frankfurt:Mar:777:3956.98 Cleveland:Oct:3812:96645.3
Notice that the colon
:
is the field separator in this HDFS data.To query data from the newly-created HDFS directory named
pxfwritable_hdfs_textsimple2
, you can create a readable external Greenplum Database table as described above that references this HDFS directory and specifiesFORMAT 'CSV' (delimiter=':')
.