gphdfs:// Protocol

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.

gphdfs:// Protocol

The gphdfs:// protocol specifies a path that can contain wild card characters on a Hadoop Distributed File System. CSV, TEXT, and custom formats are supported for HDFS files.

When Greenplum links with HDFS files, all the data is read in parallel from the HDFS data nodes into the Greenplum segments for rapid processing. Greenplum determines the connections between the segments and nodes.

Each Greenplum segment reads one set of Hadoop data blocks. For writing, each Greenplum segment writes only the data it contains. The following figure illustrates an external table located on a HDFS file system.

Figure 1. External Table Located on a Hadoop Distributed File System

The FORMAT clause describes the format of the external table files. Valid file formats are similar to the formatting options available with the PostgreSQL COPY command and user-defined formats for the gphdfs protocol. If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that Greenplum Database reads the data in the external file correctly. The gphdfs protocol requires a one-time setup. See One-time HDFS Protocol Installation.