Accessing Hadoop with PXF

PXF is compatible with Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. PXF is installed with HDFS, Hive, and HBase connectors. You use these connectors to access varied formats of data from these Hadoop distributions.

Architecture

HDFS is the primary distributed storage mechanism used by Apache Hadoop. When a user or application performs a query on a PXF external table that references an HDFS file, the Greenplum Database master node dispatches the query to all segment hosts. Each segment instance contacts the PXF agent running on its host. When it receives the request from a segment instance, the PXF agent:

  1. Allocates a worker thread to serve the request from a segment.
  2. Invokes the HDFS Java API to request metadata information for the HDFS file from the HDFS NameNode.
  3. Provides the metadata information returned by the HDFS NameNode to the segment instance.

Figure: PXF-to-Hadoop Architecture

A segment instance uses its Greenplum Database gp_segment_id and the file block information described in the metadata to assign itself a specific portion of the query data. The segment instance then sends a request to the PXF agent to read the assigned data. This data may reside on one or more HDFS DataNodes.

The PXF agent invokes the HDFS Java API to read the data and delivers it to the segment instance. The segment instance delivers its portion of the data to the Greenplum Database master node. This communication occurs across segment hosts and segment instances in parallel.

Prerequisites

Before working with Hadoop data using PXF, ensure that:

  • You have configured and initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See Configuring PXF for additional information.
  • You have configured the PXF Hadoop Connectors that you plan to use on each Greenplum Database segment host. Refer to Configuring PXF Hadoop Connectors for instructions. If you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
  • If user impersonation is enabled (the default), ensure that you have granted read (and write as appropriate) permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database to each Greenplum Database user/role name that will access the HDFS files and directories. If user impersonation is not enabled, you must grant this permission to the gpadmin user.
  • Time is synchronized between the Greenplum Database segment hosts and the external Hadoop systems.

HDFS Shell Command Primer

Hadoop includes command-line tools that interact directly with your HDFS file system. These tools support typical file system operations including copying and listing files, changing file permissions, and so forth.

The HDFS file system command syntax is hdfs dfs <options> [<file>]. Invoked with no options, hdfs dfs lists the file system options supported by the tool.

The user invoking the hdfs dfs command must have read privileges on the HDFS data store to list and view directory and file contents, and write permission to create directories and files.

The hdfs dfs options used in the PXF Hadoop topics are:

Option Description
-cat Display file contents.
-mkdir Create a directory in HDFS.
-put Copy a file from the local file system to HDFS.

Examples:

Create a directory in HDFS:

$ hdfs dfs -mkdir -p /data/exampledir

Copy a text file from your local file system to HDFS:

$ hdfs dfs -put /tmp/example.txt /data/exampledir/

Display the contents of a text file located in HDFS:

$ hdfs dfs -cat /data/exampledir/example.txt

Connectors, Data Formats, and Profiles

The PXF Hadoop connectors provide built-in profiles to support the following data formats:

  • Text
  • Avro
  • JSON
  • ORC
  • Parquet
  • RCFile
  • SequenceFile
  • AvroSequenceFile

The PXF Hadoop connectors expose the following profiles to read, and in many cases write, these supported data formats:

Data Source Data Format Profile Name(s) Deprecated Profile Name
HDFS delimited single line text hdfs:text HdfsTextSimple
HDFS delimited text with quoted linefeeds hdfs:text:multi HdfsTextMulti
HDFS Avro hdfs:avro Avro
HDFS JSON hdfs:json Json
HDFS Parquet hdfs:parquet Parquet
HDFS AvroSequenceFile hdfs:AvroSequenceFile n/a
HDFS SequenceFile hdfs:SequenceFile SequenceWritable
Hive stored as TextFile Hive, HiveText n/a
Hive stored as SequenceFile Hive n/a
Hive stored as RCFile Hive, HiveRC n/a
Hive stored as ORC Hive, HiveORC, HiveVectorizedORC n/a
Hive stored as Parquet Hive n/a
HBase Any HBase n/a

You provide the profile name when you specify the pxf protocol on a CREATE EXTERNAL TABLE command to create a Greenplum Database external table that references a Hadoop file, directory, or table. For example, the following command creates an external table that specifies the profile named hdfs:text:

CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, total_sales float8)
   LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');