Accessing HBase Table Data with PXF

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Apache HBase is a distributed, versioned, non-relational database on Hadoop.

The PXF HBase connector reads data stored in an HBase table. This section describes how to use the PXF HBase connector.

Prerequisites

Before working with HBase table data, ensure that you have:

HBase Primer

This topic assumes that you have a basic understanding of the following HBase concepts:

  • An HBase column includes two components: a column family and a column qualifier. These components are delimited by a colon : character, <column-family>:<column-qualifier>.
  • An HBase row consists of a row key and one or more column values. A row key is a unique identifier for the table row.
  • An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table.
  • An HBase cell is comprised of a row (column family, column qualifier, column value) and a timestamp. The column value and timestamp in a given cell represent a version of the value.

For detailed information about HBase, refer to the Apache HBase Reference Guide.

HBase Shell

The HBase shell is a subsystem similar to that of psql. To start the HBase shell:

$ hbase shell
<hbase output>
hbase(main):001:0>

The default HBase namespace is named default.

Example: Creating an HBase Table

Create a sample HBase table.

  1. Create an HBase table named order_info in the default namespace. order_info has two column families: product and shipping_info:

    hbase(main):> create 'order_info', 'product', 'shipping_info'
    
  2. The order_info product column family has qualifiers named name and location. The shipping_info column family has qualifiers named state and zipcode. Add some data to the order_info table:

    put 'order_info', '1', 'product:name', 'tennis racquet'
    put 'order_info', '1', 'product:location', 'out of stock'
    put 'order_info', '1', 'shipping_info:state', 'CA'
    put 'order_info', '1', 'shipping_info:zipcode', '12345'
    put 'order_info', '2', 'product:name', 'soccer ball'
    put 'order_info', '2', 'product:location', 'on floor'
    put 'order_info', '2', 'shipping_info:state', 'CO'
    put 'order_info', '2', 'shipping_info:zipcode', '56789'
    put 'order_info', '3', 'product:name', 'snorkel set'
    put 'order_info', '3', 'product:location', 'warehouse'
    put 'order_info', '3', 'shipping_info:state', 'OH'
    put 'order_info', '3', 'shipping_info:zipcode', '34567'
    

    You will access the orders_info HBase table directly via PXF in examples later in this topic.

  3. Display the contents of the order_info table:

    hbase(main):> scan 'order_info'
    ROW     COLUMN+CELL                                               
     1      column=product:location, timestamp=1499074825516, value=out of stock                                                
     1      column=product:name, timestamp=1499074825491, value=tennis racquet                                                  
     1      column=shipping_info:state, timestamp=1499074825531, value=CA                                                       
     1      column=shipping_info:zipcode, timestamp=1499074825548, value=12345                                                  
     2      column=product:location, timestamp=1499074825573, value=on floor    
     ... 
    3 row(s) in 0.0400 seconds                                         
    

Querying External HBase Data

The PXF HBase connector supports a single profile named HBase.

Use the following syntax to create a Greenplum Database external table referencing an HBase table:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hbase-table-name>?PROFILE=HBase')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');

HBase connector-specific keywords and values used in the CREATE EXTERNAL TABLE call are described below.

Keyword Value
<hbase-table-name> The name of the HBase table.
PROFILE The PROFILE keyword must specify HBase.
FORMAT The FORMAT clause must specify 'CUSTOM' (formatter='pxfwritable_import').

Data Type Mapping

HBase is byte-based; it stores all data types as an array of bytes. To represent HBase data in Greenplum Database, select a data type for your Greenplum Database column that matches the underlying content of the HBase column qualifier values.

Note: PXF does not support complex HBase objects.

Column Mapping

You can create a Greenplum Database external table that references all, or a subset of, the column qualifiers defined in an HBase table. PXF supports direct or indirect mapping between a Greenplum Database table column and an HBase table column qualifier.

Direct Mapping

When you use direct mapping to map Greenplum Database external table column names to HBase qualifiers, you specify column-family-qualified HBase qualifier names as quoted values. The PXF HBase connector passes these column names as-is to HBase as it reads the table data.

For example, to create a Greenplum Database external table accessing the following data:

  • qualifier name in the column family named product
  • qualifier zipcode in the column family named shipping_info 

from the order_info HBase table you created in Example: Creating an HBase Table, use this CREATE EXTERNAL TABLE syntax:

CREATE EXTERNAL TABLE orderinfo_hbase ("product:name" varchar, "shipping_info:zipcode" int)
    LOCATION ('pxf://order_info?PROFILE=HBase')                                     
    FORMAT 'CUSTOM' (formatter='pxfwritable_import');

Indirect Mapping via Lookup Table

When you use indirect mapping to map Greenplum Database external table column names to HBase qualifiers, you specify the mapping in a lookup table you create in HBase. The lookup table maps a <column-family>:<column-qualifier> to a column name alias that you specify when you create the Greenplum Database external table.

You must name the HBase PXF lookup table pxflookup. And you must define this table with a single column family named mapping. For example:

hbase(main):> create 'pxflookup', 'mapping'

While the direct mapping method is fast and intuitive, using indirect mapping allows you to create a shorter, character-based alias for the HBase <column-family>:<column-qualifier> name. This better reconciles HBase column qualifier names with Greenplum Database due to the following:

  • HBase qualifier names can be very long. Greenplum Database has a 63 character limit on the size of the column name.
  • HBase qualifier names can include binary or non-printable characters. Greenplum Database column names are character-based.

When populating the pxflookup HBase table, add rows to the table such that the:

  • row key specifies the HBase table name
  • mapping column family qualifier identifies the Greenplum Database column name, and the value identifies the HBase <column-family>:<column-qualifier> for which you are creating the alias.

For example, to use indirect mapping with the order_info table, add these entries to the pxflookup table:

hbase(main):> put 'pxflookup', 'order_info', 'mapping:pname', 'product:name'
hbase(main):> put 'pxflookup', 'order_info', 'mapping:zip', 'shipping_info:zipcode'

Then create a Greenplum Database external table using the following CREATE EXTERNAL TABLE syntax:

CREATE EXTERNAL TABLE orderinfo_map (pname varchar, zip int)
    LOCATION ('pxf://order_info?PROFILE=HBase')
    FORMAT 'CUSTOM' (formatter='pxfwritable_import');

Row Key

The HBase table row key is a unique identifier for the table row. PXF handles the row key in a special way.

To use the row key in the Greenplum Database external table query, define the external table using the PXF reserved column named recordkey. The recordkey column name instructs PXF to return the HBase table record key for each row.

Define the recordkey using the Greenplum Database data type bytea.

For example:

CREATE EXTERNAL TABLE <table_name> (recordkey bytea, ... ) 
    LOCATION ('pxf://<hbase_table_name>?PROFILE=HBase')
    FORMAT 'CUSTOM' (formatter='pxfwritable_import');

After you have created the external table, you can use the recordkey in a WHERE clause to filter the HBase table on a range of row key values.