Accessing Hive Table Data with PXF
A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.
- Prerequisites
- Hive Data Formats
- Data Type Mapping
- Sample Data Set
- Hive Command Line
- Querying External Hive Data
- Accessing TextFile-Format Hive Tables
- Accessing RCFile-Format Hive Tables
- Accessing ORC-Format Hive Tables
- Accessing Parquet-Format Hive Tables
- Working with Complex Data Types
- Accessing Heterogeneous Partitioned Data
Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv) TextFile, RCFile, ORC, and Parquet.
The PXF Hive connector reads data stored in a Hive table. This section describes how to use the PXF Hive connector.
Prerequisites
Before working with Hive table data using PXF, ensure that:
- You have installed and configured a Hive client on each Greenplum Database segment host. Refer to Installing and Configuring Hadoop Clients for PXF for instructions.
- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See Configuring, Initializing, and Managing PXF for PXF initialization, configuration, and startup information.
Hive Data Formats
The PXF Hive connector supports several data formats, and has defined the following profiles for accessing these formats:
File Format | Description | Profile |
---|---|---|
TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText |
SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC |
ORC | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC, HiveVectorizedORC |
Parquet | Compressed columnar data representation. | Hive |
Note: The Hive
profile supports all file storage formats. It will use the optimal Hive*
profile for the underlying file format type.
Data Type Mapping
The PXF Hive connector supports primitive and complex data types.
Primitive Data Types
To represent Hive data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type.
The following table summarizes external mapping rules for Hive primitive types.
Hive Data Type | Greenplum Data Type |
---|---|
boolean | bool |
int | int4 |
smallint | int2 |
tinyint | int2 |
bigint | int8 |
float | float4 |
double | float8 |
string | text |
binary | bytea |
timestamp | timestamp |
Note: The HiveVectorizedORC
profile does not support the timestamp data type.
Complex Data Types
Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to text
. You can create Greenplum Database functions or application code to extract subcomponents of these complex data types.
Examples using complex data types with the Hive
and HiveORC
profiles are provided later in this topic.
Note: The HiveVectorizedORC
profile does not support complex types.
Sample Data Set
Examples presented in this topic operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
Field Name | Data Type |
---|---|
location | text |
month | text |
number_of_orders | integer |
total_sales | double |
Prepare the sample data set for use:
First, create a text file:
$ vi /tmp/pxf_hive_datafile.txt
Add the following data to
pxf_hive_datafile.txt
; notice the use of the comma,
to separate the four field values:Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67 San Francisco,Sept,156,6846.34 Paris,Nov,159,7134.56 San Francisco,Jan,113,5397.89 Prague,Dec,333,9894.77 Bangalore,Jul,271,8320.55 Beijing,Dec,100,4248.41
Make note of the path to pxf_hive_datafile.txt
; you will use it in later exercises.
Hive Command Line
The Hive command line is a subsystem similar to that of psql
. To start the Hive command line:
$ HADOOP_USER_NAME=hdfs hive
The default Hive database is named default
.
Example: Creating a Hive Table
Create a Hive table to expose the sample data set.
Create a Hive table named
sales_info
in thedefault
database:hive> CREATE TABLE sales_info (location string, month string, number_of_orders int, total_sales double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;
Notice that:
- The
STORED AS textfile
subclause instructs Hive to create the table in Textfile (the default) format. Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation. - The
DELIMITED FIELDS TERMINATED BY
subclause identifies the field delimiter within a data record (line). Thesales_info
table field delimiter is a comma (,
).
- The
Load the
pxf_hive_datafile.txt
sample data file into thesales_info
table you just created:hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt' INTO TABLE sales_info;
In examples later in this section, you will access the
sales_info
Hive table directly via PXF. You will also insertsales_info
data into tables of other Hive file format types, and use PXF to access those directly as well.Perform a query on
sales_info
to verify that you loaded the data successfully:hive> SELECT * FROM sales_info;
Determining the HDFS Location of a Hive Table
Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table’s location in HDFS using the DESCRIBE
command. For example:
hive> DESCRIBE EXTENDED sales_info;
Detailed Table Information
...
location:hdfs://<namenode>:<port>/apps/hive/warehouse/sales_info
...
Querying External Hive Data
You can create a Greenplum Database external table to access Hive table data. As described previously, the PXF Hive connector defines specific profiles to support different file formats. These profiles are named Hive
, HiveText
, and HiveRC
, HiveORC
, and HiveVectorizedORC
.
The HiveText
and HiveRC
profiles are specifically optimized for text and RCFile formats, respectively. The HiveORC
and HiveVectorizedORC
profiles are optimized for ORC file formats. The Hive
profile is optimized for all file storage types; you can use the Hive
profile when the underlying Hive table is composed of multiple partitions with differing file formats.
Use the following syntax to create a Greenplum Database external table referencing a Hive table:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hive-db-name>.<hive-table-name>
?PROFILE=Hive|HiveText|HiveRC|HiveORC|HiveVectorizedORC[&DELIMITER=<delim>'])
FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
Hive connector-specific keywords and values used in the CREATE EXTERNAL TABLE call are described below.
Keyword | Value |
---|---|
<hive-db-name> | The name of the Hive database. If omitted, defaults to the Hive database named default . |
<hive-table-name> | The name of the Hive table. |
PROFILE | The PROFILE keyword must specify one of the values Hive , HiveText , HiveRC , HiveORC , or HiveVectorizedORC . |
DELIMITER | The custom options DELIMITER clause is required for both the HiveText and HiveRC profiles and identifies the field delimiter used in the Hive data set. <delim> must be a single ascii character or specified in hexadecimal representation. |
FORMAT (Hive , HiveORC , and HiveVectorizedORC profiles) |
The FORMAT clause must specify CUSTOM . The CUSTOM format requires the built-in pxfwritable_import formatter . |
FORMAT (HiveText and HiveRC profiles) |
The FORMAT clause must specify TEXT . The delimiter must be specified a second time in the ’<delim>’ formatting option. |
Accessing TextFile-Format Hive Tables
You can use the Hive
and HiveText
profiles to access Hive table data stored in TextFile format.
Note: When you use the HiveText
profile, you must specify a delimiter option in both the LOCATION
and FORMAT
clauses.
Example: Using the Hive Profile
Use the Hive
profile to create a readable Greenplum Database external table referencing the Hive sales_info
textfile format table you created earlier.
Create the external table:
postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info?PROFILE=Hive') FORMAT 'custom' (formatter='pxfwritable_import');
Query the table:
postgres=# SELECT * FROM salesinfo_hiveprofile;
location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ...
Example: Using the HiveText Profile
Use the PXF HiveText
profile to create a readable Greenplum Database external table from the Hive sales_info
textfile format table you created earlier.
Create the external table:
postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info?PROFILE=HiveText&DELIMITER=\x2c') FORMAT 'TEXT' (delimiter=E',');
Notice that:
- The
LOCATION
subclauseDELIMITER
value is specified in hexadecimal format. The\x
prefix instructs PXF to interpret the following characters as hexadecimal.2c
is the hex value for the comma character. - The
FORMAT
subclausedelimiter
value is specified as the single ascii comma character','
.E
escapes the character.
- The
Query the external table:
postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location='Beijing';
location | month | num_orders | total_sales ----------+-------+------------+------------- Beijing | Jul | 411 | 11600.67 Beijing | Dec | 100 | 4248.41 (2 rows)
Accessing RCFile-Format Hive Tables
The RCFile Hive table format is used for row columnar formatted data. The PXF HiveRC
profile provides access to RCFile data.
Note: When you use the HiveRC
profile, you must specify a delimiter option in both the LOCATION
and FORMAT
clauses.
Example: Using the HiveRC Profile
Use the HiveRC
profile to query RCFile-formatted data in a Hive table.
Start the
hive
command line and create a Hive table stored in RCFile format:$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE sales_info_rcfile (location string, month string, number_of_orders int, total_sales double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS rcfile;
Insert the data from the
sales_info
table intosales_info_rcfile
:hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
A copy of the sample data set is now stored in RCFile format in the Hive
sales_info_rcfile
table.Query the
sales_info_rcfile
Hive table to verify that the data was loaded correctly:hive> SELECT * FROM sales_info_rcfile;
Use the PXF
HiveRC
profile to create a readable Greenplum Database external table referencing the Hivesales_info_rcfile
table you created in the previous steps. You must specify a delimiter option in both theLOCATION
andFORMAT
clauses.:postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c') FORMAT 'TEXT' (delimiter=E',');
Query the external table:
postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
location | total_sales ---------------+------------- Prague | 4875.33 Rome | 1557.39 Bangalore | 8936.99 Beijing | 11600.67 ...
Accessing ORC-Format Hive Tables
The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. PXF supports ORC version 1.2.1.
ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.
ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.
Refer to the Apache orc and the Apache Hive LanguageManual ORC websites for detailed information about the ORC file format.
Profiles Supporting the ORC File Format
When choosing an ORC-supporting profile, consider the following:
The
HiveORC
profile:- Reads a single row of data at a time.
- Supports column projection.
- Supports complex types. You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to
text
.
The
HiveVectorizedORC
profile:- Reads up to 1024 rows of data at once.
- Does not support column projection.
- Does not support complex types or the timestamp data type.
Note: The HiveORC
and HiveVectorizedORC
profiles do not currently support predicate pushdown.
Example: Using the HiveORC Profile
In the following example, you will create a Hive table stored in ORC format and use the HiveORC
profile to query this Hive table.
Create a Hive table with ORC file format:
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE sales_info_ORC (location string, month string, number_of_orders int, total_sales double) STORED AS ORC;
Insert the data from the
sales_info
table intosales_info_ORC
:hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
A copy of the sample data set is now stored in ORC format in
sales_info_ORC
.Perform a Hive query on
sales_info_ORC
to verify that the data was loaded successfully:hive> SELECT * FROM sales_info_ORC;
Start the
psql
subsystem and turn on timing:$ psql -d postgres
postgres=> \timing Timing is on.
Use the PXF
HiveORC
profile to create a Greenplum Database external table referencing the Hive table namedsales_info_ORC
you created in Step 1. TheFORMAT
clause must specify'CUSTOM'
. TheHiveORC
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
.postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveORC') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Query the external table:
postgres=> SELECT * FROM salesinfo_hiveORCprofile;
location | month | number_of_orders | total_sales ---------------+-------+------------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ... Time: 425.416 ms
Example: Using the HiveVectorizedORC Profile
In the following example, you will use the HiveVectorizedORC
profile to query the sales_info_ORC
Hive table you created in the previous example.
Start the
psql
subsystem:$ psql -d postgres
Use the PXF
HiveVectorizedORC
profile to create a readable Greenplum Database external table referencing the Hive table namedsales_info_ORC
that you created in Step 1 of the previous example. TheFORMAT
clause must specify'CUSTOM'
. TheHiveVectorizedORC
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
.postgres=> CREATE EXTERNAL TABLE salesinfo_hiveVectORC(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveVectorizedORC') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Query the external table:
postgres=> SELECT * FROM salesinfo_hiveVectORC;
location | month | number_of_orders | total_sales ---------------+-------+------------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ... Time: 425.416 ms
Accessing Parquet-Format Hive Tables
The PXF Hive
profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format. Map the table columns using equivalent Greenplum Database data types. For example, if a Hive table is created in the default
schema using:
hive> CREATE TABLE hive_parquet_table (location string, month string,
number_of_orders int, total_sales double)
STORED AS parquet;
Define the Greenplum Database external table:
postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://default.hive_parquet_table?profile=Hive')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
And query the table:
postgres=# SELECT month, number_of_orders FROM pxf_parquet_table;
Working with Complex Data Types
Example: Using the Hive Profile with Complex Data Types
This example employs the Hive
profile and the array and map complex types, specifically an array of integers and a string key/value pair map.
The data schema for this example includes fields with the following names and data types:
Field Name | Data Type |
---|---|
index | int |
name | string |
intarray | array of integers |
propmap | map of string key and value pairs |
When specifying an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, you must also specify the map key termination character.
Create a text file from which you will load the data set:
$ vi /tmp/pxf_hive_complex.txt
Add the following text to
pxf_hive_complex.txt
. This data uses a comma,
to separate field values, the percent symbol%
to separate collection items, and a:
to terminate map key values:3,Prague,1%2%3,zone:euro%status:up 89,Rome,4%5%6,zone:euro 400,Bangalore,7%8%9,zone:apac%status:pending 183,Beijing,0%1%2,zone:apac 94,Sacramento,3%4%5,zone:noam%status:down 101,Paris,6%7%8,zone:euro%status:up 56,Frankfurt,9%0%1,zone:euro 202,Jakarta,2%3%4,zone:apac%status:up 313,Sydney,5%6%7,zone:apac%status:pending 76,Atlanta,8%9%0,zone:noam%status:down
Create a Hive table to represent this data:
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '%' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE;
Notice that:
FIELDS TERMINATED BY
identifies a comma as the field terminator.- The
COLLECTION ITEMS TERMINATED BY
subclause specifies the percent sign as the collection items (array item, map key/value pair) terminator. MAP KEYS TERMINATED BY
identifies a colon as the terminator for map keys.
Load the
pxf_hive_complex.txt
sample data file into thetable_complextypes
table you just created:hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_complex.txt' INTO TABLE table_complextypes;
Perform a query on Hive table
table_complextypes
to verify that the data was loaded successfully:hive> SELECT * FROM table_complextypes;
3 Prague [1,2,3] {"zone":"euro","status":"up"} 89 Rome [4,5,6] {"zone":"euro"} 400 Bangalore [7,8,9] {"zone":"apac","status":"pending"} ...
Use the PXF
Hive
profile to create a readable Greenplum Database external table referencing the Hive table namedtable_complextypes
:postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray text, propmap text) LOCATION ('pxf://table_complextypes?PROFILE=Hive') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Notice that the integer array and map complex types are mapped to Greenplum Database data type text.
Query the external table:
postgres=# SELECT * FROM complextypes_hiveprofile;
index | name | intarray | propmap -------+------------+----------+------------------------------------ 3 | Prague | [1,2,3] | {"zone":"euro","status":"up"} 89 | Rome | [4,5,6] | {"zone":"euro"} 400 | Bangalore | [7,8,9] | {"zone":"apac","status":"pending"} 183 | Beijing | [0,1,2] | {"zone":"apac"} 94 | Sacramento | [3,4,5] | {"zone":"noam","status":"down"} 101 | Paris | [6,7,8] | {"zone":"euro","status":"up"} 56 | Frankfurt | [9,0,1] | {"zone":"euro"} 202 | Jakarta | [2,3,4] | {"zone":"apac","status":"up"} 313 | Sydney | [5,6,7] | {"zone":"apac","status":"pending"} 76 | Atlanta | [8,9,0] | {"zone":"noam","status":"down"} (10 rows)
intarray
andpropmap
are each serialized as text strings.
Example: Using the HiveORC Profile with Complex Data Types
In the following example, you will create and populate a Hive table stored in ORC format. You will use the HiveORC
profile to query the complex types in this Hive table.
Create a Hive table with ORC storage format:
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '%' MAP KEYS TERMINATED BY ':' STORED AS ORC;
Insert the data from the
table_complextypes
table you created in the previous example intotable_complextypes_ORC
:hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes;
A copy of the sample data set is now stored in ORC format in
table_complextypes_ORC
.Perform a Hive query on
table_complextypes_ORC
to verify that the data was loaded successfully:hive> SELECT * FROM table_complextypes_ORC;
OK 3 Prague [1,2,3] {"zone":"euro","status":"up"} 89 Rome [4,5,6] {"zone":"euro"} 400 Bangalore [7,8,9] {"zone":"apac","status":"pending"} ...
Start the
psql
subsystem:$ psql -d postgres
Use the PXF
HiveORC
profile to create a readable Greenplum Database external table from the Hive table namedtable_complextypes_ORC
you created in Step 1. TheFORMAT
clause must specify'CUSTOM'
. TheHiveORC
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
.postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text) LOCATION ('pxf://default.table_complextypes_ORC?PROFILE=HiveORC') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Notice that the integer array and map complex types are again mapped to Greenplum Database data type text.
Query the external table:
postgres=> SELECT * FROM complextypes_hiveorc;
index | name | intarray | propmap -------+------------+----------+------------------------------------ 3 | Prague | [1,2,3] | {"zone":"euro","status":"up"} 89 | Rome | [4,5,6] | {"zone":"euro"} 400 | Bangalore | [7,8,9] | {"zone":"apac","status":"pending"} ...
intarray
andpropmap
are again serialized as text strings.
Accessing Heterogeneous Partitioned Data
You can use the PXF Hive
profile with any Hive file storage types. With the Hive
profile, you can access heterogeneous format data in a single Hive table where the partitions may be stored in different file formats.
Example: Using the Hive Profile with Heterogenous Data
In this example, you create a partitioned Hive external table. The table is composed of the HDFS data files associated with the sales_info
(text format) and sales_info_rcfile
(RC format) Hive tables you created in previous exercises. You will partition the data by year, assigning the data from sales_info
to the year 2013, and the data from sales_info_rcfile
to the year 2016. (Ignore at the moment the fact that the tables contain the same data.) You will then use the PXF Hive
profile to query this partitioned Hive external table.
Create a Hive external table named
hive_multiformpart
that is partitioned by a string field namedyear
:$ HADOOP_USER_NAME=hdfs hive
hive> CREATE EXTERNAL TABLE hive_multiformpart( location string, month string, number_of_orders int, total_sales double) PARTITIONED BY( year string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Describe the
sales_info
andsales_info_rcfile
tables, noting the HDFS filelocation
for each table:hive> DESCRIBE EXTENDED sales_info; hive> DESCRIBE EXTENDED sales_info_rcfile;
Create partitions in the
hive_multiformpart
table for the HDFS file locations associated with each of thesales_info
andsales_info_rcfile
tables:hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2013') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info'; hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2016') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info_rcfile';
Explicitly identify the file format of the partition associated with the
sales_info_rcfile
table:hive> ALTER TABLE hive_multiformpart PARTITION (year='2016') SET FILEFORMAT RCFILE;
You need not specify the file format of the partition associated with the
sales_info
table, asTEXTFILE
format is the default.Query the
hive_multiformpart
table:hive> SELECT * from hive_multiformpart; ... Bangalore Jul 271 8320.55 2016 Beijing Dec 100 4248.41 2016 Prague Jan 101 4875.33 2013 Rome Mar 87 1557.39 2013 ... hive> SELECT * from hive_multiformpart WHERE year='2013'; hive> SELECT * from hive_multiformpart WHERE year='2016';
Show the partitions defined for the
hive_multiformpart
table and exithive
:hive> SHOW PARTITIONS hive_multiformpart; year=2013 year=2016 hive> quit;
Start the
psql
subsystem:$ psql -d postgres
Use the PXF
Hive
profile to create a readable Greenplum Database external table referencing from the Hivehive_multiformpart
external table you created in the previous steps:postgres=# CREATE EXTERNAL TABLE pxf_multiformpart(location text, month text, num_orders int, total_sales float8, year text) LOCATION ('pxf://default.hive_multiformpart?PROFILE=Hive') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Query the PXF external table:
postgres=# SELECT * FROM pxf_multiformpart;
location | month | num_orders | total_sales | year ---------------+-------+------------+-------------+-------- .... Prague | Dec | 333 | 9894.77 | 2013 Bangalore | Jul | 271 | 8320.55 | 2013 Beijing | Dec | 100 | 4248.41 | 2013 Prague | Jan | 101 | 4875.33 | 2016 Rome | Mar | 87 | 1557.39 | 2016 Bangalore | May | 317 | 8936.99 | 2016 ....
Perform a second query to calculate the total number of orders for the year 2013:
postgres=# SELECT sum(num_orders) FROM pxf_multiformpart WHERE month='Dec' AND year='2013'; sum ----- 433