Using the Greenplum Parallel File Server (gpfdist)

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Using the Greenplum Parallel File Server (gpfdist)

The gpfdist protocol provides the best performance and is the easiest to set up. gpfdist ensures optimum use of all segments in your Greenplum Database system for external table reads.

This topic describes the setup and management tasks for using gpfdist with external tables.

About gpfdist Setup and Performance

Consider the following scenarios for optimizing your ETL network performance.

  • Allow network traffic to use all ETL host Network Interface Cards (NICs) simultaneously. Run one instance of gpfdist on the ETL host, then declare the host name of each NIC in the LOCATION clause of your external table definition (see Examples for Creating External Tables).
Figure 1. External Table Using Single gpfdist Instance with Multiple NICs

  • Divide external table data equally among multiple gpfdist instances on the ETL host. For example, on an ETL system with two NICs, run two gpfdist instances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the two gpfdists.
Figure 2. External Tables Using Multiple gpfdist Instances with Multiple NICs

Note: Use pipes (|) to separate formatted text when you submit files to gpfdist. Greenplum Database encloses comma-separated text strings in single or double quotes. gpfdist has to remove the quotes to parse the strings. Using pipes to separate formatted text avoids the extra step and improves performance.

Controlling Segment Parallelism

The gp_external_max_segs server configuration parameter controls the number of segment instances that can access a single gpfdist instance simultaneously. 64 is the default. You can set the number of segments such that some segments process external data files and some perform other database processing. Set this parameter in the postgresql.conf file of your master instance.

Installing gpfdist

gpfdist is installed in $GPHOME/bin of your Greenplum Database master host installation. Run gpfdist on a machine other than the Greenplum Database master or standby master, such as on a machine devoted to ETL processing. Running gpfdist on the master or standby master can have a performance impact on query execution. To install gpfdist on your ETL server, get it from the Greenplum Load Tools package and follow its installation instructions.

Starting and Stopping gpfdist

You can start gpfdist in your current directory location or in any directory that you specify. The default port is 8080.

From your current directory, type:

gpfdist &

From a different directory, specify the directory from which to serve files, and optionally, the HTTP port to run on.

To start gpfdist in the background and log output messages and errors to a log file:

$ gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &

For multiple gpfdist instances on the same ETL host (see Figure 1), use a different base directory and port for each instance. For example:

$ gpfdist -d /var/load_files1 -p 8081 -l /home/gpadmin/log1 &
$ gpfdist -d /var/load_files2 -p 8082 -l /home/gpadmin/log2 &

To stop gpfdist when it is running in the background:

First find its process id:

$ ps -ef | grep gpfdist

Then kill the process, for example (where 3456 is the process ID in this example):

$ kill 3456

Troubleshooting gpfdist

The segments access gpfdist at runtime. Ensure that the Greenplum segment hosts have network access to gpfdist. gpfdist is a web server: test connectivity by running the following command from each host in the Greenplum array (segments and master):

$ wget http://gpfdist_hostname:port/filename
         

The CREATE EXTERNAL TABLE definition must have the correct host name, port, and file names for gpfdist. Specify file names and paths relative to the directory from which gpfdist serves files (the directory path specified when gpfdist started). See Examples for Creating External Tables.

If you start gpfdist on your system and IPv6 networking is disabled, gpfdist displays this warning message when testing for an IPv6 port.

[WRN gpfdist.c:2050] Creating the socket failed

If the corresponding IPv4 port is available, gpfdist uses that port and the warning for IPv6 port can be ignored. To see information about the ports that gpfdist tests, use the -V option.

For information about IPv6 and IPv4 networking, see your operating system documentation.

When reading or writing data with the gpfdist or gfdists protocol, the gpfdist utility rejects HTTP requests that do not include X-GP-PROTO in the request header. If X-GP-PROTO is not detected in the header request gpfist returns a 400 error in the status line of the HTTP response header: 400 invalid request (no gp-proto).

Greenplum Database includes X-GP-PROTO in the HTTP request header to indicate that the request is from Greenplum Database.