Loading and Unloading Data

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.

Loading and Unloading Data

Greenplum supports parallel data loading and unloading for large amounts of data, as well as single file, non-parallel import and export for small amounts of data.

The first topics of this section describe methods for loading and writing data into and out of a Greenplum Database. The last topic describes how to format data files.

Greenplum Database Loading Tools Overview

Greenplum provides the following tools for loading and unloading data.

  • External Tables enable accessing external files as if they are regular database tables.
  • gpload provides an interface to the Greenplum Database parallel loader.
  • COPY is the standard PostgreSQL non-parallel data loading tool.

External Tables

External tables allow you to access external files from within the database as if they are regular database tables. Used with gpfdist, the Greenplum parallel file distribution program, external tables provide full parallelism by using the resources of all Greenplum segments to load or unload data. Greenplum Database leverages the parallel architecture of the Hadoop Distributed File System to access files on that system.

You can query external table data directly and in parallel using SQL commands such as SELECT, JOIN, or SORT EXTERNAL TABLE DATA, and you can create views for external tables.

The steps for using external tables are:

  1. Define the external table.
  2. Do one of the following:
    • Start the Greenplum files server(s) if you plan to use the gpfdist or gpdists protocols.
    • Verify that you have already set up the required one-time configuration for gphdfs.
  3. Place the data files in the correct locations.
  4. Query the external table with SQL commands.

Greenplum Database provides readable and writable external tables:

  • Readable external tables for data loading. Readable external tables support basic extraction, transformation, and loading (ETL) tasks common in data warehousing. Greenplum Database segment instances read external table data in parallel to optimize large load operations. You cannot modify readable external tables.
  • Writable external tables for data unloading. Writable external tables support:
    • Selecting data from database tables to insert into the writable external table.
    • Sending data to an application as a stream of data. For example, unload data from Greenplum Database and send it to an application that connects to another database or ETL tool to load the data elsewhere.
    • Receiving output from Greenplum parallel MapReduce calculations.

    Writable external tables allow only INSERT operations.

External tables can be file-based or web-based. External tables using the file:// protocol are read-only tables.

  • Regular (file-based) external tables access static flat files. Regular external tables are rescannable: the data is static while the query runs.
  • Web (web-based) external tables access dynamic data sources, either on a web server with the http:// protocol or by executing OS commands or scripts. Web external tables are not rescannable: the data can change while the query runs.

Dump and restore operate only on external and web external table definitions, not on the data sources.

gpload

The gpload data loading utility is the interface to Greenplum's external table parallel loading feature. gpload uses a load specification defined in a YAML formatted control file to load data into the target database as follows:
  1. Invoke the Greenplum parallel file server program (gpfdist).
  2. Create an external table definition based on the source data defined.
  3. Load the source data into the target table in the database according to gpload MODE (Insert, Update or Merge).

COPY

Greenplum Database offers the standard PostgreSQL COPY command for loading and unloading data. COPY is a non-parallel load/unload mechanism: data is loaded/unloaded in a single process using the Greenplum master database instance. For small amounts of data, COPY offers a simple way to move data into or out of a database in a single transaction, without the administrative overhead of setting up an external table.

Loading Data into Greenplum Database

Accessing File-Based External Tables

To create an external table definition, you specify the format of your input files and the location of your external data sources. For information about input file formats, see Formatting Data Files.

Use one of the following protocols to access external table data sources. You cannot mix protocols in CREATE EXTERNAL TABLE statements.

  • gpfdist: points to a directory on the file host and serves external data files to all Greenplum Database segments in parallel.
  • gpfdists: the secure version of gpfdist.
  • file:// accesses external data files on a segment host that the Greenplum superuser (gpadmin) can access.
  • gphdfs: accesses files on a Hadoop Distributed File System (HDFS).

gpfdist and gpfdists require a one-time setup during table creation.

gpfdist

gpfdist serves external data files from a directory on the file host to all Greenplum Database segments in parallel. gpfdist uncompresses gzip (.gz) and bzip2 (.bz2) files automatically. Run gpfdist on the host on which the external data files reside.

All primary segments access the external file(s) in parallel, subject to the number of segments set in the gp_external_max_segments server configuration parameter. Use multiple gpfdist data sources in a CREATE EXTERNAL TABLE statement to scale the external table's scan performance. For more information about configuring this, see Controlling Segment Parallelism.

You can use the wildcard character (*) or other C-style pattern matching to denote multiple files to get. The files specified are assumed to be relative to the directory that you specified when you started gpfdist.

gpfdist is located in the $GPHOME/bin directory on your Greenplum Database master host and on each segment host. See the gpfdist reference documentation for more information about using gpfdist with external tables.

gpfdists

The gpfdists protocol is a secure version of gpfdist. gpfdists enables encrypted communication and secure identification of the file server and the Greenplum Database to protect against attacks such as eavesdropping and man-in-the-middle attacks.

gpfdists implements SSL security in a client/server scheme as follows.

  • Client certificates are required.
  • Multilingual certificates are not supported.
  • A Certificate Revocation List (CRL) is not supported.
  • The TLSv1 protocol is used with the TLS_RSA_WITH_AES_128_CBC_SHA encryption algorithm.
  • SSL parameters cannot be changed.
  • SSL renegotiation is supported.
  • The SSL ignore host mismatch parameter is set to false.
  • Private keys containing a passphrase are not supported for the gpfdist file server (server.key) and for the Greenplum Database (client.key).
  • Issuing certificates that are appropriate for the operating system in use is the user's responsibility. Generally, converting certificates as shown in https://www.sslshopper.com/ssl-converter.html is supported.
Note: A server started with the gpfdist --ssl option can only communicate with the gpfdists protocol. A server that was started with gpfdist without the --ssl option can only communicate with the gpfdist protocol.

Use one of the following methods to invoke the gpfdists protocol.

  • Run gpfdist with the --ssl option and then use the gpfdists protocol in the LOCATION clause of a CREATE EXTERNAL TABLE statement.
  • Use a YAML Control File with the SSL option set to true and run gpload. Running gpload starts the gpfdist server with the --ssl option, then uses the gpfdists protocol.
Important: Do not protect the private key with a passphrase. The server does not prompt for a passphrase for the private key, and loading data fails with an error if one is required.

gpfdists requires that the following client certificates reside in the $PGDATA/gpfdists directory on each segment.

  • The client certificate file, client.crt
  • The client private key file, client.key
  • The trusted certificate authorities, root.crt

For an example of loading data into an external table securely, see Example 3—Multiple gpfdists instances.

file

The file:// protocol requires that the external data file(s) reside on a segment host in a location accessible by the Greenplum superuser (gpadmin). The number of URIs that you specify corresponds to the number of segment instances that will work in parallel to access the external table. For example, if you have a Greenplum Database system with 8 primary segments and you specify 2 external files, only 2 of the 8 segments will access the external table in parallel at query time. The number of external files per segment host cannot exceed the number of primary segment instances on that host. For example, if your array has 4 primary segment instances per segment host, you can place 4 external files on each segment host. The host name used in the URI must match the segment host name as registered in the gp_segment_configuration system catalog table. Tables based on the file:// protocol can only be readable tables.

The system view pg_max_external_files shows how many external table files are permitted per external table. This view lists the available file slots per segment host when using the file:// protocol. The view is only applicable for the file:// protocol. For example:

SELECT * FROM pg_max_external_files;

gphdfs

This protocol specifies a path that can contain wild card characters on a Hadoop Distributed File System. CSV, TEXT, and custom formats are allowed for HDFS files.

When Greenplum links with HDFS files, all the data is read in parallel from the HDFS data nodes into the Greenplum segments for rapid processing. Greenplum determines the connections between the segments and nodes.

Each Greenplum segment reads one set of Hadoop data blocks. For writing, each Greenplum segment writes only the data contained on it.

Figure 1. External Table Located on a Hadoop Distributed File System

The FORMAT clause describes the format of the external table files. Valid file formats are similar to the formatting options available with the PostgreSQL COPY command and user-defined formats for the gphdfs protocol.

If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that Greenplum Database reads the data in the external file correctly.

The gphdfs protocol requires a one-time setup. See One-time HDFS Protocol Installation.

Errors in External Table Data

By default, if external table data contains an error, the command fails and no data loads into the target database table. Define the external table with single row error handling to enable loading correctly-formatted rows and to isolate data errors in external table data. See Handling Load Errors.

The gpfdist file server uses the HTTP protocol. External table queries that use LIMIT end the connection after retrieving the rows, causing an HTTP socket error. If you use LIMIT in queries of external tables that use the gpfdist:// or http:// protocols, ignore these errors – data is returned to the database as expected.

Using the Greenplum Parallel File Server (gpfdist)

The gpfdist protocol provides the best performance and is the easiest to set up. gpfdist ensures optimum use of all segments in your Greenplum Database system for external table reads.

This topic describes the setup and management tasks for using gpfdist with external tables.

About gpfdist Setup and Performance

Consider the following scenarios for optimizing your ETL network performance.

  • Allow network traffic to use all ETL host Network Interface Cards (NICs) simultaneously. Run one instance of gpfdist on the ETL host, then declare the host name of each NIC in the LOCATION clause of your external table definition (see Creating External Tables - Examples).
Figure 2. External Table Using Single gpfdist Instance with Multiple NICs

  • Divide external table data equally among multiple gpfdist instances on the ETL host. For example, on an ETL system with two NICs, run two gpfdist instances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the two gpfdists.
Figure 3. External Tables Using Multiple gpfdist Instances with Multiple NICs

Note: Use pipes (|) to separate formatted text when you submit files to gpfdist. Greenplum Database encloses comma-separated text strings in single or double quotes. gpfdist has to remove the quotes to parse the strings. Using pipes to separate formatted text avoids the extra step and improves performance.

Controlling Segment Parallelism

The gp_external_max_segs server configuration parameter controls the number of segment instances that can access a single gpfdist instance simultaneously. 64 is the default. You can set the number of segments such that some segments process external data files and some perform other database processing. Set this parameter in the postgresql.conf file of your master instance.

Installing gpfdist

gpfdist is installed in $GPHOME/bin of your Greenplum Database master host installation. Run gpfdist from a machine other than the Greenplum Database master, such as on a machine devoted to ETL processing. If you want to install gpfdist on your ETL server, get it from the Greenplum Load Tools package and follow its installation instructions.

Starting and Stopping gpfdist

You can start gpfdist in your current directory location or in any directory that you specify. The default port is 8080.

From your current directory, type:

gpfdist &

From a different directory, specify the directory from which to serve files, and optionally, the HTTP port to run on.

To start gpfdist in the background and log output messages and errors to a log file:

$ gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &

For multiple gpfdist instances on the same ETL host (see Figure 3), use a different base directory and port for each instance. For example:

$ gpfdist -d /var/load_files1 -p 8081 -l /home/gpadmin/log1 &
$ gpfdist -d /var/load_files2 -p 8082 -l /home/gpadmin/log2 &

To stop gpfdist when it is running in the background:

First find its process id:

$ ps –ef | grep gpfdist

Then kill the process, for example (where 3456 is the process ID in this example):

$ kill 3456

Troubleshooting gpfdist

The segments access gpfdist at runtime. Ensure that the Greenplum segment hosts have network access to gpfdist. gpfdist is a web server: test connectivity by running the following command from each host in the Greenplum array (segments and master):

$ wget http://gpfdist_hostname:port/filename

The CREATE EXTERNAL TABLE definition must have the correct host name, port, and file names for gpfdist. Specify file names and paths relative to the directory from which gpfdist serves files (the directory path specified when gpfdist started). See Creating External Tables - Examples.

If you start gpfdist on your system and IPv6 networking is disabled, gpfdist displays this warning message when testing for an IPv6 port.

[WRN gpfdist.c:2050] Creating the socket failed

If the corresponding IPv4 port is available, gpfdist uses that port and the warning for IPv6 port can be ignored. To see information about the ports that gpfdist tests, use the -V option.

For information about IPv6 and IPv4 networking, see your operating system documentation.

Using Hadoop Distributed File System (HDFS) Tables

Greenplum Database leverages the parallel architecture of a Hadoop Distributed File System to read and write data files efficiently with the gphdfs protocol. There are three steps to using HDFS:

One-time HDFS Protocol Installation

Install and configure Hadoop for use with gphdfs as follows:
  1. Install Java 1.6 or later on all Greenplum Database hosts: master, segment, and standby master.
  2. Install a supported Hadoop distribution on all hosts. The distribution must be the same on all hosts. For Hadoop installation information, see the Hadoop distribution documentation.
    Greenplum Database supports the following Hadoop distributions:
    Table 1. Hadoop Distributions
    Hadoop Distribution Version gp_hadoop_ target_version
    Pivotal HD Pivotal HD 2.0

    Pivotal HD 1.01

    gphd-2.0
    Greenplum HD Greenplum HD 1.2 gphd-1.2
    Greenplum HD 1.1 gphd-1.1 (default)
    Cloudera CDH 5.0, 5.1 cdh4.1
    CDH 4.12 - CDH 4.7 cdh3u2
    Hortonworks Data Platform HDP 2.1 hdp2 hdp2
    Greenplum MR3 Greenplum MR 1.2 gpmr-1.2
    Greenplum MR 1.0 gpmr-1.0
    Notes:

    1. Pivotal HD 1.0 is a distribution of Hadoop 2.0

    2. For CDH 4.1, only CDH4 with MRv1 is supported

    3. For Greenplum MR target for MR v. 1.0 or 1.2, you can install the MapR client program. For information about setting up the MapR client, see the MapR documentation. http://doc.mapr.com/display/MapR/Home

    For the latest information regarding supported Hadoop distributions, see the Greenplum Database Release Notes for your release.

  3. After installation, ensure that the Greenplum system user (gpadmin) has read and execute access to the Hadoop libraries or to the Greenplum MR client.
  4. Set the following environment variables on all segments:
    • JAVA_HOME – the Java home directory
    • HADOOP_HOME – the Hadoop home directory

    For example, add lines such as the following to the gpadmin user .bashrc profile.

    export JAVA_HOME=/usr/java/default
    export HADOOP_HOME=/usr/lib/gphd
    

    The variables must be set in the ~gpadmin/.bashrc or the ~gpadmin/.bash_profile file so that the gpadmin user shell environment can locate the Java home and Hadoop home.

  5. Set the following Greenplum Database server configuration parameters and restart Greenplum Database.
    Table 2. Server Configuration Parameters for Hadoop Targets
    Configuration Parameter Description Default Value Set Classifications
    gp_hadoop_target_version The Hadoop target. Choose one of the following.

    gphd-1.0

    gphd-1.1

    gphd-1.2

    gphd-2.0

    gpmr-1.0

    gpmr-1.2

    hdp2

    cdh3u2

    cdh4.1

    gphd-1.1 master

    session

    reload

    gp_hadoop_home When using Pivotal HD, specify the installation directory for Hadoop. For example, the default installation directory is /usr/lib/gphd.

    When using Greenplum HD 1.2 or earlier, specify the same value as the HADOOP_HOME environment variable.

    NULL master

    session

    reload

    For example, the following commands use the Greenplum Database utilities gpconfig and gpstop to set the server configuration parameters and restart Greenplum Database:

    gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"
    gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd' gpstop -u"
    

    For information about the Greenplum Database utilities gpconfig and gpstop, see the Greenplum Database Utility Guide.

Grant Privileges for the HDFS Protocol

To enable privileges required to create external tables that access files on HDFS:

  1. Grant the following privileges on gphdfs to the owner of the external table.
    • Grant SELECT privileges to enable creating readable external tables on HDFS.
    • Grant INSERT privileges to enable creating writable external tables on HDFS.

      Use the GRANT command to grant read privileges (SELECT) and, if needed, write privileges (INSERT) on HDFS to the Greenplum system user (gpadmin).

      GRANT INSERT ON PROTOCOL gphdfs TO gpadmin;
  2. Greenplum Database uses Greenplum OS credentials to connect to HDFS. Grant read privileges and, if needed, write privileges to HDFS to the Greenplum administrative user (gpadmin OS user).

Specify HDFS Data in an External Table Definition

CREATE EXTERNAL TABLE'sLOCATION option for Hadoop files has the following format:

LOCATION ('gphdfs://hdfs_host[:port]/path/filename.txt')

For any connector except gpmr-1.0-gnet-1.0.0.1, specify a name node port. Do not specify a port with the gpmr-1.0-gnet-1.0.0.1 connector.

Restrictions for HDFS files are as follows.
  • You can specify one path for a readable external table with gphdfs. Wildcard characters are allowed. If you specify a directory, the default is all files in the directory.

    You can specify only a directory for writable external tables.

  • Format restrictions are as follows.
    • Only TEXT format is allowed for readable and writable external tables.
    • Only the gphdfs_import formatter is allowed for readable external tables with a custom format.
    • Only the gphdfs_export formatter is allowed for writable external tables with a custom format.
    • You can set compression only for writable external tables. Compression settings are automatic for readable external tables.
Setting Compression Options for Hadoop Writable External Tables

Compression options for Hadoop Writable External Tables use the form of a URI query and begin with a question mark. Specify multiple compression options with an ampersand (&).

Table 3. Compression Options
Compression Option Values Default Value
compress true or false false
compression_type BLOCK or RECORD RECORD
codec Codec class name GzipCodec for text format and DefaultCodec for gphdfs_export format.

Place compression options in the query portion of the URI.

HDFS Readable and Writable External Table Examples

The following code defines a readable external table for an HDFS file named filename.txt on port 8081.

=# CREATE EXTERNAL TABLE ext_expenses ( 
        name text, 
        date date, 
        amount float4, 
        category text, 
        desc1 text ) 
   LOCATION ('gphdfs://hdfshost-1:8081/data/filename.txt') 
   FORMAT 'TEXT' (DELIMITER ',');
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

The following code defines a set of readable external tables that have a custom format located in the same HDFS directory on port 8081.

=# CREATE EXTERNAL TABLE ext_expenses 
   LOCATION ('gphdfs://hdfshost-1:8081/data/custdat*.dat') 
   FORMAT 'custom' (formatter='gphdfs_import');
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

The following code defines an HDFS directory for a writable external table on port 8081 with all compression options specified.

=# CREATE WRITABLE EXTERNAL TABLE ext_expenses 
   LOCATION ('gphdfs://hdfshost-1:8081/data/?compress=true&compression_type=RECORD
   &codec=org.apache.hadoop.io.compress.DefaultCodec') 
   FORMAT 'custom' (formatter='gphdfs_export');
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

Because the previous code uses the default compression options for compression_type and codec, the following command is equivalent.

=# CREATE WRITABLE EXTERNAL TABLE ext_expenses 
   LOCATION    ('gphdfs://hdfshost-1:8081/data?compress=true')
   FORMAT 'custom' (formatter='gphdfs_export');
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

Reading and Writing Custom-Formatted HDFS Data

Use MapReduce and the CREATE EXTERNAL TABLE command to read and write data with custom formats on HDFS.

To read custom-formatted data:

  1. Author and run a MapReduce job that creates a copy of the data in a format accessible to Greenplum Database.
  2. Use CREATE EXTERNAL TABLE to read the data into Greenplum Database.

See Example 1 - Read Custom-Formatted Data from HDFS.

To write custom-formatted data:

  1. Write the data.
  2. Author and run a MapReduce program to convert the data to the custom format and place it on the Hadoop Distributed File System.

See Example 2 - Write Custom-Formatted Data from Greenplum Database to HDFS.

MapReduce code is written in Java. Greenplum provides Java APIs for use in the MapReduce code. The Javadoc is available in the $GPHOME/docs directory. To view the Javadoc, expand the file gnet-1.1-javadoc.tar and open index.html. The Javadoc documents the following packages:

com.emc.greenplum.gpdb.hadoop.io
com.emc.greenplum.gpdb.hadoop.mapred
com.emc.greenplum.gpdb.hadoop.mapreduce.lib.input
com.emc.greenplum.gpdb.hadoop.mapreduce.lib.output

The HDFS cross-connect packages contain the Java library, which contains the packages GPDBWritable, GPDBInputFormat, and GPDBOutputFormat. The Java packages are available in $GPHOME/lib/hadoop. Compile and run the MapReduce job with the cross-connect package. For example, compile and run the MapReduce job with gphd-1.0-gnet-1.0.0.1.jar if you use the Greenplum HD 1.0 distribution of Hadoop.

To make the Java library available to all Hadoop users, the Hadoop cluster administrator should place the corresponding gphdfs connector jar in the $HADOOP_HOME/lib directory and restart the job tracker. If this is not done, a Hadoop user can still use the gphdfs connector jar; but with the distributed cache technique.

Example 1 - Read Custom-Formatted Data from HDFS

The sample code makes the following assumptions.

  • The data is contained in HDFS directory /demo/data/temp and the name node is running on port 8081.
  • This code writes the data in Greenplum Database format to /demo/data/MRTest1 on HDFS.
  • The data contains the following columns, in order.
    1. A long integer
    2. A Boolean
    3. A text string

Sample MapReduce Code

import com.emc.greenplum.gpdb.hadoop.io.GPDBWritable;
import com.emc.greenplum.gpdb.hadoop.mapreduce.lib.input.GPDBInputFormat;
import com.emc.greenplum.gpdb.hadoop.mapreduce.lib.output.GPDBOutputFormat;
import java.io.*;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.util.*;

public class demoMR {

/*
 * Helper routine to create our generic record. This section shows the
 * format of the data. Modify as necessary. 
 */
 public static GPDBWritable generateGenericRecord() throws
      IOException {
 int[] colType = new int[3];
 colType[0] = GPDBWritable.BIGINT;
 colType[1] = GPDBWritable.BOOLEAN;
 colType[2] = GPDBWritable.VARCHAR;
 
  /*
   * This section passes the values of the data. Modify as necessary. 
   */ 
  GPDBWritable gw = new GPDBWritable(colType); 
  gw.setLong (0, (long)12345);  
  gw.setBoolean(1, true); 
  gw.setString (2, "abcdef");
  return gw; 
} 

/* 
 * DEMO Map/Reduce class test1
 * -- Regardless of the input, this section dumps the generic record
 * into GPDBFormat/
 */
 public static class Map_test1 
     extends Mapper<LongWritable, Text, LongWritable, GPDBWritable> {
 
  private LongWritable word = new LongWritable(1);

  public void map(LongWritable key, Text value, Context context) throws
       IOException { 
    try {
      GPDBWritable gw = generateGenericRecord();
      context.write(word, gw); 
      } 
      catch (Exception e) { 
        throw new IOException (e.getMessage()); 
      } 
    }
  }

  Configuration conf = new Configuration(true);
  Job job = new Job(conf, "test1");
  job.setJarByClass(demoMR.class);
  job.setInputFormatClass(TextInputFormat.class);
  job.setOutputKeyClass (LongWritable.class);
  job.setOutputValueClass (GPDBWritable.class);
  job.setOutputFormatClass(GPDBOutputFormat.class);
  job.setMapperClass(Map_test1.class);
  FileInputFormat.setInputPaths (job, new Path("/demo/data/tmp"));
  GPDBOutputFormat.setOutputPath(job, new Path("/demo/data/MRTest1"));
  job.waitForCompletion(true);
}
Run CREATE EXTERNAL TABLE

The Hadoop location corresponds to the output path in the MapReduce job.

=# CREATE EXTERNAL TABLE demodata 
   LOCATION ('gphdfs://hdfshost-1:8081/demo/data/MRTest1') 
   FORMAT 'custom' (formatter='gphdfs_import');
Note:

Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

Example 2 - Write Custom-Formatted Data from Greenplum Database to HDFS

The sample code makes the following assumptions.

  • The data in Greenplum Database format is located on the Hadoop Distributed File System on /demo/data/writeFromGPDB_42 on port 8081.
  • This code writes the data to /demo/data/MRTest2 on port 8081.
  1. Run a SQL command to create the writable table.
    =# CREATE WRITABLE EXTERNAL TABLE demodata 
       LOCATION ('gphdfs://hdfshost-1:8081/demo/data/MRTest2') 
       FORMAT 'custom' (formatter='gphdfs_export');
    
  2. Author and run code for a MapReduce job. Use the same import statements shown in Example 1 - Read Custom-Formatted Data from HDFS.
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.

MapReduce Sample Code

/*
 * DEMO Map/Reduce class test2
 * -- Convert GPDBFormat back to TEXT
 */
public static class Map_test2 extends Mapper<LongWritable, GPDBWritable,
  Text, NullWritable> { 
  public void map(LongWritable key, GPDBWritable value, Context context )
    throws IOException {
    try {
      context.write(new Text(value.toString()), NullWritable.get());
    } catch (Exception e) { throw new IOException (e.getMessage()); }
  }
}

public static void runTest2() throws Exception{
Configuration conf = new Configuration(true);
 Job job = new Job(conf, "test2");
 job.setJarByClass(demoMR.class);
 job.setInputFormatClass(GPDBInputFormat.class);
 job.setOutputKeyLClass (Text.class);
 job.setOutputValueClass(NullWritable.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 job.setMapperClass(Map_test2.class);
     GPDBInputFormat.setInputPaths (job, 
     new Path("/demo/data/writeFromGPDB_42"));
 GPDBOutputFormat.setOutputPath(job, new Path("/demo/data/MRTest2"));
 job.waitForCompletion(true);
     
}

Creating and Using Web External Tables

CREATE EXTERNAL WEB TABLE creates a web table definition. Web external tables allow Greenplum Database to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, the data is not rescannable.

You can define command-based or URL-based web external tables. The definition forms are distinct: you cannot mix command-based and URL-based definitions.

Command-based Web External Tables

The output of a shell command or script defines command-based web table data. Specify the command in the EXECUTE clause of CREATE EXTERNAL WEB TABLE. The data is current as of the time the command runs. The EXECUTE clause runs the shell command or script on the specified master, and/or segment host or hosts. The command or script must reside on the hosts corresponding to the host(s) defined in the EXECUTE clause.

By default, the command is run on segment hosts when active segments have output rows to process. For example, if each segment host runs four primary segment instances that have output rows to process, the command runs four times per segment host. You can optionally limit the number of segment instances that execute the web table command. All segments included in the web table definition in the ON clause run the command in parallel.

The command that you specify in the external table definition executes from the database and cannot access environment variables from .bashrc or .profile. Set environment variables in the EXECUTE clause. For example:

=# CREATE EXTERNAL WEB TABLE output (output text)
    EXECUTE 'PATH=/home/gpadmin/programs; export PATH; myprogram.sh' 
    FORMAT 'TEXT';

Scripts must be executable by the gpadmin user and reside in the same location on the master or segment hosts.

The following command defines a web table that runs a script. The script runs on each segment host where a segment has output rows to process.

=# CREATE EXTERNAL WEB TABLE log_output 
    (linenum int, message text) 
    EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST 
    FORMAT 'TEXT' (DELIMITER '|');

URL-based Web External Tables

A URL-based web table accesses data from a web server using the HTTP protocol. Web table data is dynamic; the data is not rescannable.

Specify the LOCATION of files on a web server using http://. The web data file(s) must reside on a web server that Greenplum segment hosts can access. The number of URLs specified corresponds to the number of segment instances that work in parallel to access the web table. For example, if you specify two external files to a Greenplum Database system with eight primary segments, two of the eight segments access the web table in parallel at query runtime.

The following sample command defines a web table that gets data from several URLs.

=# CREATE EXTERNAL WEB TABLE ext_expenses (name text, 
  date date, amount float4, category text, description text) 
  LOCATION ( 

  'http://intranet.company.com/expenses/sales/file.csv',
  'http://intranet.company.com/expenses/exec/file.csv',
  'http://intranet.company.com/expenses/finance/file.csv',
  'http://intranet.company.com/expenses/ops/file.csv',
  'http://intranet.company.com/expenses/marketing/file.csv',
  'http://intranet.company.com/expenses/eng/file.csv' 

   )
  FORMAT 'CSV' ( HEADER );

Loading Data Using an External Table

Use SQL commands such as INSERT and SELECT to query a readable external table, the same way that you query a regular database table. For example, to load travel expense data from an external table, ext_expenses, into a database table, expenses_travel:

=# INSERT INTO expenses_travel 
    SELECT * from ext_expenses where category='travel';

To load all data into a new database table:

=# CREATE TABLE expenses AS SELECT * from ext_expenses;

Loading and Writing Non-HDFS Custom Data

Greenplum supports TEXT and CSV formats for importing and exporting data. You can load and write the data in other formats by defining and using a custom format or custom protocol.

For information about importing custom data from HDFS, see Reading and Writing Custom-Formatted HDFS Data.

Using a Custom Format

You specify a custom data format in the FORMAT clause of CREATE EXTERNAL TABLE.

FORMAT 'CUSTOM' (formatter=format_function, key1=val1,...keyn=valn)

Where the 'CUSTOM' keyword indicates that the data has a custom format and formatter specifies the function to use to format the data, followed by comma-separated parameters to the formatter function.

Greenplum Database provides functions for formatting fixed-width data, but you must author the formatter functions for variable-width data. The steps are as follows.

  1. Author and compile input and output functions as a shared library.
  2. Specify the shared library function with CREATE FUNCTION in Greenplum Database.
  3. Use the formatter parameter of CREATE EXTERNAL TABLE's FORMAT clause to call the function.

Importing and Exporting Fixed Width Data

Specify custom formats for fixed-width data with the Greenplum Database functions fixedwith_in and fixedwidth_out. These functions already exist in the file $GPHOME/share/postgresql/cdb_external_extensions.sql. The following example declares a custom format, then calls the fixedwidth_in function to format the data.

CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('file://<host>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in, 
         name='20', address='30', age='4');

The following options specify how to import fixed width data.

  • Read all the data.

    To load all the fields on a line of fixed with data, you must load them in their physical order. You must specify the field length, but cannot specify a starting and ending position. The fields names in the fixed width arguments must match the order in the field list at the beginning of the CREATE TABLE command.

  • Set options for blank and null characters.

    Trailing blanks are trimmed by default. To keep trailing blanks, use the preserve_blanks=on option.You can reset the trailing blanks option to the default with the preserve_blanks=off option.

    Use the null='null_string_value' option to specify a value for null characters.

  • If you specify preserve_blanks=on, you must also define a value for null characters.
  • If you specify preserve_blanks=off, null is not defined, and the field contains only blanks, Greenplum writes a null to the table. If null is defined, Greenplum writes an empty string to the table.

    Use the line_delim='line_ending' parameter to specify the line ending character. The following examples cover most cases. The E specifies an escape string constant.

    line_delim=E'\n'
    line_delim=E'\r'
    line_delim=E'\r\n'
    line_delim='abc'

Examples: Read Fixed-Width Data

The following examples show how to read fixed-width data.

Example 1 – Loading a table with all fields defined
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('file://<host>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in, 
         name=20, address=30, age=4);
Example 2 – Loading a table with PRESERVED_BLANKS ON
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in, 
         name=20, address=30, age=4,
        preserve_blanks='on',null='NULL');
Example 3 – Loading data with no line delimiter
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('file://<host>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in, 
         name='20', address='30', age='4', line_delim='?@')
Example 4 – Create a writable external table with a \r\n line delimiter
CREATE WRITABLE EXTERNAL TABLE students_out (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')     
FORMAT 'CUSTOM' (formatter=fixedwidth_out, 
        name=20, address=30, age=4, line_delim=E'\r\n');

Using a Custom Protocol

Greenplum provides protocols such as gpfdist, http, and file for accessing data over a network, or you can author a custom protocol. You can use the standard data formats, TEXT and CSV, or a custom data format with custom protocols.

You can create a custom protocol whenever the available built-in protocols do not suffice for a particular need. For example, if you need to connect Greenplum Database in parallel to another system directly, and stream data from one to the other without the need to materialize the system data on disk or use an intermdiate process such as gpfdist.

  1. Author the send, receive, and (optionally) validator functions in C, with a predefined API. These functions are compiled and registered with the Greenplum Database. For an example custom protocol, see Example Custom Data Access Protocol.
  2. After writing and compiling the read and write functions into a shared object (.so), declare a database function that points to the .so file and function names.

    The following examples use the compiled import and export code.

    CREATE FUNCTION myread() RETURNS integer
    as '$libdir/gpextprotocol.so', 'myprot_import'
    LANGUAGE C STABLE;
    CREATE FUNCTION mywrite() RETURNS integer
    as '$libdir/gpextprotocol.so', 'myprot_export'
    LANGUAGE C STABLE;
    

    The format of the optional function is:

    CREATE OR REPLACE FUNCTION myvalidate() RETURNS void 
    AS '$libdir/gpextprotocol.so', 'myprot_validate' 
    LANGUAGE C STABLE; 
    
  3. Create a protocol that accesses these functions. Validatorfunc is optional.
    CREATE TRUSTED PROTOCOL myprot(
    writefunc='mywrite',
    readfunc='myread', 
    validatorfunc='myvalidate');
  4. Grant access to any other users, as necessary.
    GRANT ALL ON PROTOCOL myprot TO otheruser
    
  5. Use the protocol in readable or writable external tables.
    CREATE WRITABLE EXTERNAL TABLE ext_sales(LIKE sales)
    LOCATION ('myprot://<meta>/<meta>/…')
    FORMAT 'TEXT';
    CREATE READABLE EXTERNAL TABLE ext_sales(LIKE sales)
    LOCATION('myprot://<meta>/<meta>/…')
    FORMAT 'TEXT';
    

Declare custom protocols with the SQL command CREATE TRUSTED PROTOCOL, then use the GRANT command to grant access to your users. For example:

  • Allow a user to create a readable external table with a trusted protocol
    GRANT SELECT ON PROTOCOL <protocol name> TO <user name>
    
  • Allow a user to create a writable external table with a trusted protocol
    GRANT INSERT ON PROTOCOL <protocol name> TO <user name>
    
  • Allow a user to create readable and writable external tables with a trusted protocol
    GRANT ALL ON PROTOCOL <protocol name> TO <user name>
    

Creating External Tables - Examples

The following examples show how to define external data with different protocols. Each CREATE EXTERNAL TABLE command can contain only one protocol.

Note: When using IPv6, always enclose the numeric IP addresses in square brackets.

Start gpfdist before you create external tables with the gpfdist protocol. The following code starts the gpfdist file server program in the background on port 8081 serving files from directory /var/data/staging. The logs are saved in /home/gpadmin/log.

gpfdist -p 8081 -d /var/data/staging -l /home/gpadmin/log &

The CREATE EXTERNAL TABLE SQL command defines external tables, the location and format of the data to load, and the protocol to use to load the data, but does not load data into the table. For example, the following command creates an external table, ext_expenses, from pipe ('|') delimited text data located on etlhost-1:8081 and etlhost-2:8081. See the Greenplum Database Reference Guide for information about CREATE EXTERNAL TABLE.

Example 1— Single gpfdist instance on single-NIC machine

Creates a readable external table, ext_expenses, using the gpfdist protocol. The files are formatted with a pipe ( | ) as the column delimiter.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date, amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*', 
             'gpfdist://etlhost-1:8082/*')
   FORMAT 'TEXT' (DELIMITER '|');

Example 2—Multiple gpfdist instances

Creates a readable external table, ext_expenses, using the gpfdist protocol from all files with the txt extension. The column delimiter is a pipe ( | ) and NULL (' ') is a space.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8081/*.txt')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') ;

Example 3—Multiple gpfdists instances

Creates a readable external table, ext_expenses, from all files with the txt extension using the gpfdists protocol. The column delimiter is a pipe ( | ) and NULL (' ') is a space. For information about the location of security certificates, see gpfdists.

  1. Run gpfdist with the --ssl option.
  2. Run the following command.
    =# CREATE EXTERNAL TABLE ext_expenses ( name text, 
       date date,  amount float4, category text, desc1 text ) 
       LOCATION ('gpfdists://etlhost-1:8081/*.txt', 
                 'gpfdists://etlhost-2:8082/*.txt')
       FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') ;
    

Example 4—Single gpfdist instance with error logging

Uses the gpfdist protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe ( | ) and NULL (' ') is a space.

Access to the external table is single row error isolation mode. Input data formatting errors are written to the error table, err_customer, with a description of the error. Query err_customer to see the errors, then fix the issues and reload the rejected data. If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT value), the entire external table operation fails and no rows are processed.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date, amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

To create the readable ext_expenses table from CSV-formatted text files:

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'CSV' ( DELIMITER ',' )
   LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

Example 5—TEXT Format on a Hadoop Distributed File Server

Creates a readable external table, ext_expenses, using the gphdfs protocol. The column delimiter is a pipe ( | ).

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gphdfs://hdfshost-1:8081/data/filename.txt') 
   FORMAT 'TEXT' (DELIMITER '|');

gphdfs requires only one data path.

For examples of reading and writing custom formatted data on a Hadoop Distributed File System, see Reading and Writing Custom-Formatted HDFS Data.

Example 6—Multiple files in CSV format with header rows

Creates a readable external table, ext_expenses, using the file protocol. The files are CSV format and have a header row.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('file://filehost:5432/data/international/*', 
             'file://filehost:5432/data/regional/*'
             'file://filehost:5432/data/supplement/*.csv')
   FORMAT 'CSV' (HEADER);

Example 7—Readable Web External Table with Script

Creates a readable web external table that executes a script once per segment host:

=# CREATE EXTERNAL WEB TABLE log_output (linenum int, 
    message text) 
   EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST 
   FORMAT 'TEXT' (DELIMITER '|');

Example 8—Writable External Table with gpfdist

Creates a writable external table, sales_out, that uses gpfdist to write output data to the file sales.out. The column delimiter is a pipe ( | ) and NULL (' ') is a space. The file will be created in the directory specified when you started the gpfdist file server.

=# CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) 
   LOCATION ('gpfdist://etl1:8081/sales.out')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   DISTRIBUTED BY (txn_id);

Example 9—Writable External Web Table with Script

Creates a writable external web table, campaign_out, that pipes output data received by the segments to an executable script, to_adreport_etl.sh:

=# CREATE WRITABLE EXTERNAL WEB TABLE campaign_out 
   (LIKE campaign) 
   EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
   FORMAT 'TEXT' (DELIMITER '|');

Example 10—Readable and Writable External Tables with XML Transformations

Greenplum Database can read and write XML data to and from external tables with gpfdist. For information about setting up an XML transform, see Transforming XML Data.

Handling Load Errors

Readable external tables are most commonly used to select data to load into regular database tables. You use the CREATE TABLE AS SELECT or INSERT INTO commands to query the external table data. By default, if the data contains an error, the entire command fails and the data is not loaded into the target database table.

The SEGMENT REJECT LIMIT clause allows you to isolate format errors in external table data and to continue loading correctly formatted rows. Use SEGMENT REJECT LIMIT to set an error threshold, specifying the reject limit count as number of ROWS (the default) or as a PERCENT of total rows (1-100).

The entire external table operation is aborted, and no rows are processed, if the number of error rows reaches the SEGMENT REJECT LIMIT. The limit of error rows is per-segment, not per entire operation. The operation processes all good rows, and it discards or logs any erroneous rows into an error table (if you specified an error table), if the number of error rows does not reach the SEGMENT REJECT LIMIT.

The LOG ERRORS INTO clause allows you to keep error rows for further examination. Use LOG ERRORS INTO to declare an error table in which to write error rows.

When you set SEGMENT REJECT LIMIT, Greenplum scans the external data in single row error isolation mode. Single row error isolation mode applies to external data rows with format errors such as extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Greenplum does not check constraint errors, but you can filter constraint errors by limiting the SELECT from an external table at runtime. For example, to eliminate duplicate key errors:

=# INSERT INTO table_with_pkeys 
    SELECT DISTINCT * FROM external_table;
Note: When loading data with the COPY command or an external table, the value of the server configuration parameter gp_initial_bad_row_limit limits the initial number of rows that are processed that are not formatted properly. The default is to stop processing if the first 1000 rows contain formatting errors. See the Greenplum Database Reference Guide for information about the parameter.

Define an External Table with Single Row Error Isolation

The following example logs errors internally in Greenplum Database and sets an error threshold of 10 errors.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*', 
             'gpfdist://etlhost-2:8082/*')
   FORMAT 'TEXT' (DELIMITER '|')
   LOG ERRORS SEGMENT REJECT LIMIT 10 
     ROWS;

Use the built-in SQL function gp_read_error_log('external_table') to read the error log data. This example command displays the log errors for ext_expenses:

SELECT gp_read_error_log('ext_expenses');

For information about the format of the error log, see Viewing Bad Rows in the Error Table or Error Log.

The built-in SQL function gp_truncate_error_log('external_table') deletes the error data. This example deletes the error log data created from the previous external table example :

SELECT gp_truncate_error_log('ext_expenses'); 

The following example creates an external table, ext_expenses, sets an error threshold of 10 errors, and writes error rows to the table err_expenses.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*', 
             'gpfdist://etlhost-2:8082/*')
   FORMAT 'TEXT' (DELIMITER '|')
   LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 
     ROWS;

Create an Error Table and Declare a Reject Limit

The following SQL fragment creates an error table, err_expenses, and declares a reject limit of 10 rows.

LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS

Viewing Bad Rows in the Error Table or Error Log

If you use single row error isolation (see Define an External Table with Single Row Error Isolation or Running COPY in Single Row Error Isolation Mode), any rows with formatting errors are logged either into an error table or are logged internally. The error table or error log has the following columns. The error table has the following columns:

Table 4. Error Table Format
column type description
cmdtime timestampz Timestamp when the error occurred.
relname text The name of the external table or the target table of a COPY command.
filename text The name of the load file that contains the error.
linenum int If COPY was used, the line number in the load file where the error occurred. For external tables using file:// protocol or gpfdist:// protocol and CSV format, the file name and line number is logged.
bytenum int For external tables with the gpfdist:// protocol and data in TEXT format: the byte offset in the load file where the error occurred. gpfdist parses TEXT files in blocks, so logging a line number is not possible.

CSV files are parsed a line at a time so line number tracking is possible for CSV files.

errmsg text The error message text.
rawdata text The raw data of the rejected row.
rawbytes bytea In cases where there is a database encoding error (the client encoding used cannot be converted to a server-side encoding), it is not possible to log the encoding error as rawdata. Instead the raw bytes are stored and you will see the octal code for any non seven bit ASCII characters.

You can use SQL commands to query the error table and view the rows that did not load. For example:

=# SELECT * from err_expenses;

You can use the SQL function gp_read_error_log() to display formatting errors that were logged internally in Greenplum Database. For example, this command displays the error log information for the table ext_expenses:

SELECT gp_read_error_log('ext_expenses');

Identifying Invalid CSV Files in Error Table Data

If a CSV file contains invalid formatting, the rawdata field in the error table can contain several combined rows. For example, if a closing quote for a specific field is missing, all the following newlines are treated as embedded newlines. When this happens, Greenplum stops parsing a row when it reaches 64K, puts that 64K of data into the error table as a single row, resets the quote flag, and continues. If this happens three times during load processing, the load file is considered invalid and the entire load fails with the message "rejected N or more rows". See Escaping in CSV Formatted Files for more information on the correct use of quotes in CSV files.

Moving Data between Tables

You can use CREATE TABLE AS or INSERT...SELECT to load external and web external table data into another (non-external) database table, and the data will be loaded in parallel according to the external or web external table definition.

If an external table file or web external table data source has an error, one of the following will happen, depending on the isolation mode used:

  • Tables without error isolation mode: any operation that reads from that table fails. Loading from external and web external tables without error isolation mode is an all or nothing operation.
  • Tables with error isolation mode: the entire file will be loaded, except for the problematic rows (subject to the configured REJECT_LIMIT)

Loading Data

The following methods load data from readable external tables.

  • Use the gpload utility
  • Use the gphdfs protocol
  • Load with copy

Loading Data with gpload

The Greenplum gpload utility loads data using readable external tables and the Greenplum parallel file server (gpfdist or gpfdists). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist or gpfdists setup in a single configuration file.

To use gpload
  1. Ensure that your environment is set up to run gpload. Some dependent files from your Greenplum Database installation are required, such as gpfdist and Python, as well as network access to the Greenplum segment hosts. See the Greenplum Database Reference Guide for details.
  2. Create your load control file. This is a YAML-formatted file that specifies the Greenplum Database connection information, gpfdist configuration information, external table options, and data format. See the Greenplum Database Reference Guide for details.

    For example:

    ---
    VERSION: 1.0.0.1
    DATABASE: ops
    USER: gpadmin
    HOST: mdw-1
    PORT: 5432
    GPLOAD:
       INPUT:
        - SOURCE:
             LOCAL_HOSTNAME:
               - etl1-1
               - etl1-2
               - etl1-3
               - etl1-4
             PORT: 8081
             FILE: 
               - /var/load/data/*
        - COLUMNS:
               - name: text
    - amount: float4
               - category: text
               - desc: text
               - date: date
        - FORMAT: text
        - DELIMITER: '|'
        - ERROR_LIMIT: 25
        - ERROR_TABLE: payables.err_expenses
       OUTPUT:
        - TABLE: payables.expenses
        - MODE: INSERT
    SQL:
       - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
       - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    
  3. Run gpload, passing in the load control file. For example:
    gpload -f my_load.yml
    

Loading Data with the gphdfs Protocol

If you use INSERT INTO to insert data into a Greenplum table from a table on the Hadoop Distributed File System that was defined as an external table with the gphdfs protocol, the data is copied in parallel. For example:

INSERT INTO gpdb_table (select * from hdfs_ext_table);

Loading Data with COPY

COPY FROM copies data from a file or standard input into a table and appends the data to the table contents. COPY is non-parallel: data is loaded in a single process using the Greenplum master instance.

To optimize the performance and throughput of COPY, run multiple COPY commands concurrently in separate sessions and divide the data evenly across all concurrent processes. To optimize throughput, run one concurrent COPY operation per CPU.

The COPY source file must be accessible to the master host. Specify the COPY source file name relative to the master host location.

Greenplum copies data from STDIN or STDOUT using the connection between the client and the master server.

Running COPY in Single Row Error Isolation Mode

By default, COPY stops an operation at the first error: if the data contains an error, the operation fails and no data loads. If you run COPY FROM in single row error isolation mode, Greenplum skips rows that contain format errors and loads properly formatted rows. Single row error isolation mode applies only to rows in the input file that contain format errors. If the data contains a contraint error such as violation of a NOT NULL, CHECK, or UNIQUE constraint, the operation fails and no data loads.

Specifying SEGMENT REJECT LIMIT runs the COPY operation in single row error isolation mode. Specify the acceptable number of error rows on each segment, after which the entire COPY FROM operation fails and no rows load. The error row count is for each Greenplum segment, not for the entire load operation.

If the COPY operation does not reach the error limit, Greenplum loads all correctly-formatted rows and discards the error rows. The LOG ERRORS INTO clause allows you to keep error rows for further examination. Use LOG ERRORS INTO to declare an error table in which to write error rows. For example:

=> COPY country FROM '/data/gpdb/country_data' 
   WITH DELIMITER '|' LOG ERRORS INTO err_country 
   SEGMENT REJECT LIMIT 10 ROWS;

See Viewing Bad Rows in the Error Table or Error Log for information about investigating error rows.

Optimizing Data Load and Query Performance

Use the following tips to help optimize your data load and subsequent query performance.

  • Drop indexes before loading data into existing tables.

    Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded. You can temporarily increase the maintenance_work_mem server configuration parameter to help speed up CREATE INDEX commands, though load performance is affected. Drop and recreate indexes only when there are no active users on the system.

  • Create indexes last when loading data into new tables. Create the table, load the data, and create any required indexes.
  • Run ANALYZE after loading data. If you significantly altered the data in a table, run ANALYZE or VACUUM ANALYZE to update table statistics for the query planner. Current statistics ensure that the planner makes the best decisions during query planning and avoids poor performance due to inaccurate or nonexistent statistics.
  • Run VACUUM after load errors. If the load operation does not run in single row error isolation mode, the operation stops at the first error. The target table contains the rows loaded before the error occurred. You cannot access these rows, but they occupy disk space. Use the VACUUM command to recover the wasted space.

Unloading Data from Greenplum Database

A writable external table allows you to select rows from other database tables and output the rows to files, named pipes, to applications, or as output targets for Greenplum parallel MapReduce calculations. You can define file-based and web-based writable external tables.

This topic describes how to unload data from Greenplum Database using parallel unload (writable external tables) and non-parallel unload (COPY).

Defining a File-Based Writable External Table

Writable external tables that output data to files use the Greenplum parallel file server program, gpfdist, or the Hadoop Distributed File System interface, gphdfs.

Use the CREATE WRITABLE EXTERNAL TABLE command to define the external table and specify the location and format of the output files. See Using the Greenplum Parallel File Server (gpfdist) for instructions on setting up gpfdist for use with an external table and Using Hadoop Distributed File System (HDFS) Tables for instructions on setting up gphdfs for use with an external table.

  • With a writable external table using the gpfdist protocol, the Greenplum segments send their data to gpfdist, which writes the data to the named file. gpfdist must run on a host that the Greenplum segments can access over the network. gpfdist points to a file location on the output host and writes data received from the Greenplum segments to the file. To divide the output data among multiple files, list multiple gpfdist URIs in your writable external table definition.
  • A writable external web table sends data to an application as a stream of data. For example, unload data from Greenplum Database and send it to an application that connects to another database or ETL tool to load the data elsewhere. Writable external web tables use the EXECUTE clause to specify a shell command, script, or application to run on the segment hosts and accept an input stream of data. See Defining a Command-Based Writable External Web Table for more information about using EXECUTE commands in a writable external table definition.

You can optionally declare a distribution policy for your writable external tables. By default, writable external tables use a random distribution policy. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table improves unload performance by eliminating the requirement to move rows over the interconnect. If you unload data from a particular table, you can use the LIKE clause to copy the column definitions and distribution policy from the source table.

Example 1—Greenplum file server (gpfdist)

=# CREATE WRITABLE EXTERNAL TABLE unload_expenses 
   ( LIKE expenses ) 
   LOCATION ('gpfdist://etlhost-1:8081/expenses1.out', 
             'gpfdist://etlhost-2:8081/expenses2.out')
 FORMAT 'TEXT' (DELIMITER ',')
 DISTRIBUTED BY (exp_id);

Example 2—Hadoop file server (gphdfs)

=# CREATE WRITABLE EXTERNAL TABLE unload_expenses 
   ( LIKE expenses ) 
   LOCATION ('gphdfs://hdfslhost-1:8081/path') 
 FORMAT 'TEXT' (DELIMITER ',')
 DISTRIBUTED BY (exp_id);

You can only specify a directory for a writable external table with the gphdfs protocol. (You can only specify one file for a readable external table with the gphdfs protocol)

Note: The default port number is 9000.

Defining a Command-Based Writable External Web Table

You can define writable external web tables to send output rows to an application or script. The application must accept an input stream, reside in the same location on all of the Greenplum segment hosts, and be executable by the gpadmin user. All segments in the Greenplum system run the application or script, whether or not a segment has output rows to process.

Use CREATE WRITABLE EXTERNAL WEB TABLE to define the external table and specify the application or script to run on the segment hosts. Commands execute from within the database and cannot access environment variables (such as $PATH). Set environment variables in the EXECUTE clause of your writable external table definition. For example:

=# CREATE WRITABLE EXTERNAL WEB TABLE output (output text) 
    EXECUTE 'export PATH=$PATH:/home/gpadmin/programs;
    myprogram.sh' 
    FORMAT 'TEXT'
    DISTRIBUTED RANDOMLY;

The following Greenplum Database variables are available for use in OS commands executed by a web or writable external table. Set these variables as environment variables in the shell that executes the command(s). They can be used to identify a set of requests made by an external table statement across the Greenplum Database array of hosts and segment instances.

Table 5. External Table EXECUTE Variables
Variable Description
$GP_CID Command count of the transaction executing the external table statement.
$GP_DATABASE The database in which the external table definition resides.
$GP_DATE The date on which the external table command ran.
$GP_MASTER_HOST The host name of the Greenplum master host from which the external table statement was dispatched.
$GP_MASTER_PORT The port number of the Greenplum master instance from which the external table statement was dispatched.
$GP_SEG_DATADIR The location of the data directory of the segment instance executing the external table command.
$GP_SEG_PG_CONF The location of the postgresql.conf file of the segment instance executing the external table command.
$GP_SEG_PORT The port number of the segment instance executing the external table command.
$GP_SEGMENT_COUNT The total number of primary segment instances in the Greenplum Database system.
$GP_SEGMENT_ID The ID number of the segment instance executing the external table command (same as dbid in gp_segment_configuration).
$GP_SESSION_ID The database session identifier number associated with the external table statement.
$GP_SN Serial number of the external table scan node in the query plan of the external table statement.
$GP_TIME The time the external table command was executed.
$GP_USER The database user executing the external table statement.
$GP_XID The transaction ID of the external table statement.

Disabling EXECUTE for Web or Writable External Tables

There is a security risk associated with allowing external tables to execute OS commands or scripts. To disable the use of EXECUTE in web and writable external table definitions, set the gp_external_enable_exec server configuration parameter to off in your master postgresql.conf file:

gp_external_enable_exec = off

Unloading Data Using a Writable External Table

Writable external tables allow only INSERT operations. You must grant INSERT permission on a table to enable access to users who are not the table owner or a superuser. For example:

GRANT INSERT ON writable_ext_table TO admin;

To unload data using a writable external table, select the data from the source table(s) and insert it into the writable external table. The resulting rows are output to the writable external table. For example:

INSERT INTO writable_ext_table SELECT * FROM regular_table;

Unloading Data Using COPY

COPY TO copies data from a table to a file (or standard input) on the Greenplum master host using a single process on the Greenplum master instance. Use COPY to output a table's entire contents, or filter the output using a SELECT statement. For example:

COPY (SELECT * FROM country WHERE country_name LIKE 'A%') 
TO '/home/gpadmin/a_list_countries.out';

Transforming XML Data

The Greenplum Database data loader gpfdist provides transformation features to load XML data into a table and to write data from the Greenplum Database to XML files. The following diagram shows gpfdist performing an XML transform.

Figure 4. External Tables using XML Transformations

To load or extract XML data:

The first three steps comprise most of the development effort. The last two steps are straightforward and repeatable, suitable for production.

Determine the Transformation Schema

To prepare for the transformation project:

  1. Determine the goal of the project, such as indexing data, analyzing data, combining data, and so on.
  2. Examine the XML file and note the file structure and element names.
  3. Choose the elements to import and decide if any other limits are appropriate.

For example, the following XML file, prices.xml, is a simple, short file that contains price records. Each price record contains two fields: an item number and a price.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<prices>
  <pricerecord>
    <itemnumber>708421</itemnumber>
    <price>19.99</price>
  </pricerecord>
  <pricerecord>
    <itemnumber>708466</itemnumber>
    <price>59.25</price>
  </pricerecord>
  <pricerecord>
    <itemnumber>711121</itemnumber>
    <price>24.99</price>
  </pricerecord>
</prices>

The goal is to import all the data into a Greenplum Database table with an integer itemnumber column and a decimal price column.

Write a Transform

The transform specifies what to extract from the data.You can use any authoring environment and language appropriate for your project. For XML transformations Pivotal suggests choosing a technology such as XSLT, Joost (STX), Java, Python, or Perl, based on the goals and scope of the project.

In the price example, the next step is to transform the XML data into a simple two-column delimited format.

708421|19.99
708466|59.25
711121|24.99

The following STX transform, called input_transform.stx, completes the data transformation.

<?xml version="1.0"?>
<stx:transform version="1.0"
   xmlns:stx="http://stx.sourceforge.net/2002/ns"
   pass-through="none">
  <!-- declare variables -->
  <stx:variable name="itemnumber"/>
  <stx:variable name="price"/>
  <!-- match and output prices as columns delimited by | -->
  <stx:template match="/prices/pricerecord">
    <stx:process-children/>
    <stx:value-of select="$itemnumber"/>    
<stx:text>|</stx:text>
    <stx:value-of select="$price"/>      <stx:text>
</stx:text>
  </stx:template>
  <stx:template match="itemnumber">
    <stx:assign name="itemnumber" select="."/>
  </stx:template>
  <stx:template match="price">
    <stx:assign name="price" select="."/>
  </stx:template>
</stx:transform>

This STX transform declares two temporary variables, itemnumber and price, and the following rules.

  1. When an element that satisfies the XPath expression /prices/pricerecord is found, examine the child elements and generate output that contains the value of the itemnumber variable, a | character, the value of the price variable, and a newline.
  2. When an <itemnumber> element is found, store the content of that element in the variable itemnumber.
  3. When a <price> element is found, store the content of that element in the variable price.

Write the gpfdist Configuration

The gpfdist configuration is specified as a YAML 1.1 document. It specifies rules that gpfdist uses to select a Transform to apply when loading or extracting data.

This example gpfdist configuration contains the following items:

  • the config.yaml file defining TRANSFORMATIONS
  • the input_transform.sh wrapper script, referenced in the config.yaml file
  • the input_transform.stx joost transformation, called from input_transform.sh

Aside from the ordinary YAML rules, such as starting the document with three dashes (---), a gpfdist configuration must conform to the following restrictions:

  1. a VERSION setting must be present with the value 1.0.0.1.
  2. a TRANSFORMATIONS setting must be present and contain one or more mappings.
  3. Each mapping in the TRANSFORMATION must contain:
    • a TYPE with the value 'input' or 'output'
    • a COMMAND indicating how the transform is run.
  4. Each mapping in the TRANSFORMATION can contain optional CONTENT, SAFE, and STDERR settings.

The following gpfdist configuration called config.YAML applies to the prices example. The initial indentation on each line is significant and reflects the hierarchical nature of the specification. The name prices_input in the following example will be referenced later when creating the table in SQL.

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  prices_input:
    TYPE:     input
    COMMAND:  /bin/bash input_transform.sh %filename%

The COMMAND setting uses a wrapper script called input_transform.sh with a %filename% placeholder. When gpfdist runs the prices_input transform, it invokes input_transform.sh with /bin/bash and replaces the %filename% placeholder with the path to the input file to transform. The wrapper script called input_transform.sh contains the logic to invoke the STX transformation and return the output.

If Joost is used, the Joost STX engine must be installed.

#!/bin/bash
# input_transform.sh - sample input transformation, 
# demonstrating use of Java and Joost STX to convert XML into
# text to load into Greenplum Database.
# java arguments:
#   -jar joost.jar         joost STX engine
#   -nodecl                  don't generate a <?xml?> declaration
#   $1                        filename to process
#   input_transform.stx    the STX transformation
#
# the AWK step eliminates a blank line joost emits at the end
java \
    -jar joost.jar \
    -nodecl \
    $1 \
    input_transform.stx \
 | awk 'NF>0

The input_transform.sh file uses the Joost STX engine with the AWK interpreter. The following diagram shows the process flow as gpfdist runs the transformation.



Load the Data

Create the tables with SQL statements based on the appropriate schema.

There are no special requirements for the Greenplum Database tables that hold loaded data. In the prices example, the following command creates the appropriate table.

CREATE TABLE prices (
  itemnumber integer,       
  price       decimal        
) 
DISTRIBUTED BY (itemnumber);

Transfer and Store the Data

Use one of the following approaches to transform the data with gpfdist.

  • GPLOAD supports only input transformations, but is easier to implement in many cases.
  • INSERT INTO SELECT FROM supports both input and output transformations, but exposes more details.

Transforming with GPLOAD

Transforming data with GPLOAD requires that the settings TRANSFORM and TRANSFORM_CONFIG. appear in the INPUT section of the GPLOAD control file. For more information about the syntax and placement of these settings in the GPLOAD control file, see the Greenplum Database Reference Guide.

  • TRANSFORM_CONFIG specifies the name of the gpfdist configuration file.
  • The TRANSFORM setting indicates the name of the transformation that is described in the file named in TRANSFORM_CONFIG.
---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
GPLOAD:
INPUT:
- TRANSFORM_CONFIG: config.yaml
- TRANSFORM: prices_input
- SOURCE:
FILE: prices.xml

The transformation name must appear in two places: in the TRANSFORM setting of the gpfdist configuration file and in the TRANSFORMATIONS section of the file named in the TRANSFORM_CONFIG section.

In the GPLOAD control file, the optional parameter MAX_LINE_LENGTH specifies the maximum length of a line in the XML transformation data that is passed to gpload.

The following diagram shows the relationships between the GPLOAD control file, the gpfdist configuration file, and the XML data file.



Transforming with INSERT INTO SELECT FROM

Specify the transformation in the CREATE EXTERNAL TABLE definition's LOCATION clause. For example, the transform is shown in bold in the following command. (Run gpfdist first, using the command gpfdist -c config.yaml).

CREATE READABLE EXTERNAL TABLE prices_readable (LIKE prices)
   LOCATION ('gpfdist://hostname:8080/prices.xml#transform=prices_input')
   FORMAT 'TEXT' (DELIMITER '|')
   LOG ERRORS INTO prices_errortable SEGMENT REJECT LIMIT 10;

In the command above, change hostname to your hostname. prices_input comes from the configuration file.

The following query loads data into the prices table.

INSERT INTO prices SELECT * FROM prices_readable;

Configuration File Format

The gpfdist configuration file uses the YAML 1.1 document format and implements a schema for defining the transformation parameters. The configuration file must be a valid YAML document.

The gpfdist program processes the document in order and uses indentation (spaces) to determine the document hierarchy and relationships of the sections to one another. The use of white space is significant. Do not use white space for formatting and do not use tabs.

The following is the basic structure of a configuration file.

---
VERSION:   1.0.0.1
TRANSFORMATIONS: 
transformation_name1:
TYPE:      input | output
COMMAND:   command
CONTENT:   data | paths
SAFE:      posix-regex
STDERR:    server | console
transformation_name2:
TYPE:      input | output
COMMAND:   command 
...
VERSION
Required. The version of the gpfdist configuration file schema. The current version is 1.0.0.1.
TRANSFORMATIONS
Required. Begins the transformation specification section. A configuration file must have at least one transformation. When gpfdist receives a transformation request, it looks in this section for an entry with the matching transformation name.
TYPE
Required. Specifies the direction of transformation. Values are input or output.
  • input: gpfdist treats the standard output of the transformation process as a stream of records to load into Greenplum Database.
  • output: gpfdist treats the standard input of the transformation process as a stream of records from Greenplum Database to transform and write to the appropriate output.
COMMAND

Required. Specifies the command gpfdist will execute to perform the transformation.

For input transformations, gpfdist invokes the command specified in the CONTENT setting. The command is expected to open the underlying file(s) as appropriate and produce one line of TEXT for each row to load into Greenplum Database. The input transform determines whether the entire content should be converted to one row or to multiple rows.

For output transformations, gpfdist invokes this command as specified in the CONTENT setting. The output command is expected to open and write to the underlying file(s) as appropriate. The output transformation determines the final placement of the converted output.

CONTENT

Optional. The values are data and paths. The default value is data.

  • When CONTENT specifies data, the text %filename% in the COMMAND section is replaced by the path to the file to read or write.
  • When CONTENT specifies paths, the text %filename% in the COMMAND section is replaced by the path to the temporary file that contains the list of files to read or write.

The following is an example of a COMMAND section showing the text %filename% that is replaced.

COMMAND: /bin/bash input_transform.sh %filename%
SAFE

Optional. A POSIX regular expression that the paths must match to be passed to the transformation. Specify SAFE when there is a concern about injection or improper interpretation of paths passed to the command. The default is no restriction on paths.

STDERR

Optional.The values are server and console.

This setting specifies how to handle standard error output from the transformation. The default, server, specifies that gpfdist will capture the standard error output from the transformation in a temporary file and send the first 8k of that file to Greenplum Database as an error message. The error message will appear as a SQL error. Console specifies that gpfdist does not redirect or transmit the standard error output from the transformation.

XML Transformation Examples

The following examples demonstrate the complete process for different types of XML data and STX transformations. Files and detailed instructions associated with these examples are in demo/gpfdist_transform.tar.gz. Read the README file in the Before You Begin section before you run the examples. The README file explains how to download the example data file used in the examples.

Example 1 - DBLP Database Publications (In demo Directory)

This example demonstrates loading and extracting database research data. The data is in the form of a complex XML file downloaded from the University of Washington. The DBLP information consists of a top level <dblp> element with multiple child elements such as <article>, <proceedings>, <mastersthesis>, <phdthesis>, and so on, where the child elements contain details about the associated publication. For example, the following is the XML for one publication.

<?xml version="1.0" encoding="UTF-8"?> 
<mastersthesis key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Language, v1.3.</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school>
</mastersthesis> 

The goal is to import these <mastersthesis> and <phdthesis> records into the Greenplum Database. The sample document, dblp.xml, is about 130MB in size uncompressed. The input contains no tabs, so the relevent information can be converted into tab-delimited records as follows:

ms/Brown92 tab masters tab Kurt P. Brown tab PRPL: A Database 
Workload Specification Language, v1.3. tab 1992 tab Univ. of 
Wisconsin-Madison newline

With the columns:

  • key text, -- e.g. ms/Brown92
  • type text, -- e.g. masters
  • author text, -- e.g. Kurt P. Brown
  • title text, -- e.g. PRPL: A Database Workload Language, v1.3.
  • year text, -- e.g. 1992
  • school text, -- e.g. Univ. of Wisconsin-Madison

Then, load the data into Greenplum Database. After the data loads, verify the data by extracting the loaded records as XML with an output transformation.

Example 2 - IRS MeF XML Files (In demo Directory)

This example demonstrates loading a sample IRS Modernized eFile tax return using a Joost STX transformation. The data is in the form of a complex XML file.

The U.S. Internal Revenue Service (IRS) made a significant commitment to XML and specifies its use in its Modernized e-File (MeF) system. In MeF, each tax return is an XML document with a deep hierarchical structure that closely reflects the particular form of the underlying tax code.

XML, XML Schema and stylesheets play a role in their data representation and business workflow. The actual XML data is extracted from a ZIP file attached to a MIME "transmission file" message. For more information about MeF, see Modernized e-File (Overview) on the IRS web site.

The sample XML document, RET990EZ_2006.xml, is about 350KB in size with two elements:

  • ReturnHeader
  • ReturnData

The <ReturnHeader> element contains general details about the tax return such as the taxpayer's name, the tax year of the return, and the preparer. The <ReturnData> element contains multiple sections with specific details about the tax return and associated schedules.

The following is an abridged sample of the XML file.

<?xml version="1.0" encoding="UTF-8"?> 
<Return returnVersion="2006v2.0"
   xmlns="http://www.irs.gov/efile" 
   xmlns:efile="http://www.irs.gov/efile"
   xsi:schemaLocation="http://www.irs.gov/efile"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
   <ReturnHeader binaryAttachmentCount="1">
     <ReturnId>AAAAAAAAAAAAAAAAAAAA</ReturnId>
     <Timestamp>1999-05-30T12:01:01+05:01</Timestamp>
     <ReturnType>990EZ</ReturnType>
     <TaxPeriodBeginDate>2005-01-01</TaxPeriodBeginDate>
     <TaxPeriodEndDate>2005-12-31</TaxPeriodEndDate>
     <Filer>
       <EIN>011248772</EIN>
       ... more data ...
     </Filer>
     <Preparer>
       <Name>Percy Polar</Name>
       ... more data ...
     </Preparer>
     <TaxYear>2005</TaxYear>
   </ReturnHeader>
   ... more data ..

The goal is to import all the data into a Greenplum database. First, convert the XML document into text with newlines "escaped", with two columns: ReturnId and a single column on the end for the entire MeF tax return. For example:

AAAAAAAAAAAAAAAAAAAA|<Return returnVersion="2006v2.0"... 

Load the data into Greenplum Database.

Example 3 - WITSML™ Files (In demo Directory)

This example demonstrates loading sample data describing an oil rig using a Joost STX transformation. The data is in the form of a complex XML file downloaded from energistics.org.

The Wellsite Information Transfer Standard Markup Language (WITSML™) is an oil industry initiative to provide open, non-proprietary, standard interfaces for technology and software to share information among oil companies, service companies, drilling contractors, application vendors, and regulatory agencies. For more information about WITSML™, see http://www.witsml.org.

The oil rig information consists of a top level <rigs> element with multiple child elements such as <documentInfo>, <rig>, and so on. The following excerpt from the file shows the type of information in the <rig> tag.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="../stylesheets/rig.xsl" type="text/xsl" media="screen"?>
<rigs 
 xmlns="http://www.witsml.org/schemas/131" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.witsml.org/schemas/131 ../obj_rig.xsd" 
 version="1.3.1.1">
 <documentInfo>
 ... misc data ...
 </documentInfo>
 <rig uidWell="W-12" uidWellbore="B-01" uid="xr31">
     <nameWell>6507/7-A-42</nameWell>
     <nameWellbore>A-42</nameWellbore>
     <name>Deep Drill #5</name>
     <owner>Deep Drilling Co.</owner>
     <typeRig>floater</typeRig>
     <manufacturer>Fitsui Engineering</manufacturer>
     <yearEntService>1980</yearEntService>
     <classRig>ABS Class A1 M CSDU AMS ACCU</classRig>
     <approvals>DNV</approvals>
 ... more data ...

The goal is to import the information for this rig into Greenplum Database.

The sample document, rig.xml, is about 11KB in size. The input does not contain tabs so the relevant information can be converted into records delimited with a pipe (|).

W-12|6507/7-A-42|xr31|Deep Drill #5|Deep Drilling Co.|John Doe|John.Doe@example.com|

With the columns:

  • well_uid text, -- e.g. W-12
  • well_name text, -- e.g. 6507/7-A-42
  • rig_uid text, -- e.g. xr31
  • rig_name text, -- e.g. Deep Drill #5
  • rig_owner text, -- e.g. Deep Drilling Co.
  • rig_contact text, -- e.g. John Doe
  • rig_email text, -- e.g. John.Doe@example.com
  • doc xml

Then, load the data into Greenplum Database.

Formatting Data Files

When you use the Greenplum tools for loading and unloading data, you must specify how your data is formatted. COPY, CREATE EXTERNAL TABLE, and gpload have clauses that allow you to specify how your data is formatted. Data can be delimited text (TEXT) or comma separated values (CSV) format. External data must be formatted correctly to be read by Greenplum Database. This topic explains the format of data files expected by Greenplum Database.

Formatting Rows

Greenplum Database expects rows of data to be separated by the LF character (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CR+LF, 0x0D 0x0A). LF is the standard newline representation on UNIX or UNIX-like operating systems. Operating systems such as Windows or Mac OS 9 use CR or CR+LF. All of these representations of a newline are supported by Greenplum Database as a row delimiter. For more information, see Importing and Exporting Fixed Width Data.

Formatting Columns

The default column or field delimiter is the horizontal TAB character (0x09) for text files and the comma character (0x2C) for CSV files. You can declare a single character delimiter using the DELIMITER clause of COPY, CREATE EXTERNAL TABLE or gpload when you define your data format. The delimiter character must appear between any two data value fields. Do not place a delimiter at the beginning or end of a row. For example, if the pipe character ( | ) is your delimiter:

data value 1|data value 2|data value 3

The following command shows the use of the pipe character as a column delimiter:

=# CREATE EXTERNAL TABLE ext_table (name text, date date)
LOCATION ('gpfdist://<hostname>/filename.txt)
FORMAT 'TEXT' (DELIMITER '|');

Representing NULL Values

NULL represents an unknown piece of data in a column or field. Within your data files you can designate a string to represent null values. The default string is \N (backslash-N) in TEXT mode, or an empty value with no quotations in CSV mode. You can also declare a different string using the NULL clause of COPY, CREATE EXTERNAL TABLE or gpload when defining your data format. For example, you can use an empty string if you do not want to distinguish nulls from empty strings. When using the Greenplum Database loading tools, any data item that matches the designated null string is considered a null value.

Escaping

There are two reserved characters that have special meaning to Greenplum Database:

  • The designated delimiter character separates columns or fields in the data file.
  • The newline character designates a new row in the data file.

If your data contains either of these characters, you must escape the character so that Greenplum treats it as data and not as a field separator or new row. By default, the escape character is a \ (backslash) for text-formatted files and a double quote (") for csv-formatted files.

Escaping in Text Formatted Files

By default, the escape character is a \ (backslash) for text-formatted files. You can declare a different escape character in the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload. If your escape character appears in your data, use it to escape itself.

For example, suppose you have a table with three columns and you want to load the following three fields:

  • backslash = \
  • vertical bar = |
  • exclamation point = !

Your designated delimiter character is | (pipe character), and your designated escape character is \ (backslash). The formatted row in your data file looks like this:

backslash = \\ | vertical bar = \| | exclamation point = !

Notice how the backslash character that is part of the data is escaped with another backslash character, and the pipe character that is part of the data is escaped with a backslash character.

You can use the escape character to escape octal and hexidecimal sequences. The escaped value is converted to the equivalent character when loaded into Greenplum Database. For example, to load the ampersand character (&), use the escape character to escape its equivalent hexidecimal (\0x26) or octal (\046) representation.

You can disable escaping in TEXT-formatted files using the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload as follows:

ESCAPE 'OFF'

This is useful for input data that contains many backslash characters, such as web log data.

Escaping in CSV Formatted Files

By default, the escape character is a " (double quote) for CSV-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY, CREATE EXTERNAL TABLE or gpload to declare a different escape character. In cases where your selected escape character is present in your data, you can use it to escape itself.

For example, suppose you have a table with three columns and you want to load the following three fields:

  • Free trip to A,B
  • 5.89
  • Special rate "1.79"

Your designated delimiter character is , (comma), and your designated escape character is " (double quote). The formatted row in your data file looks like this:

"Free trip to A,B","5.89","Special rate ""1.79"""

The data value with a comma character that is part of the data is enclosed in double quotes. The double quotes that are part of the data are escaped with a double quote even though the field value is enclosed in double quotes.

Embedding the entire field inside a set of double quotes guarantees preservation of leading and trailing whitespace characters:

"Free trip to A,B ","5.89 ","Special rate ""1.79"" "

Note: In CSV mode, all characters are significant. A quoted value surrounded by white space, or any characters other than DELIMITER, includes those characters. This can cause errors if you import data from a system that pads CSV lines with white space to some fixed width. In this case, preprocess the CSV file to remove the trailing white space before importing the data into Greenplum Database.

Character Encoding

Character encoding systems consist of a code that pairs each character from a character set with something else, such as a sequence of numbers or octets, to facilitate data stransmission and storage. Greenplum Database supports a variety of character sets, including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended UNIX Code), UTF-8, and Mule internal code. Clients can use all supported character sets transparently, but a few are not supported for use within the server as a server-side encoding.

Data files must be in a character encoding recognized by Greenplum Database. See the Greenplum Database Reference Guide for the supported character sets. Data files that contain invalid or unsupported encoding sequences encounter errors when loading into Greenplum Database.

Note: On data files generated on a Microsoft Windows operating system, run the dos2unix system command to remove any Windows-only characters before loading into Greenplum Database.

Example Custom Data Access Protocol

The following is the API for the Greenplum Database custom data access protocol. The example protocol implementation gpextprotocal.c is written in C and shows how the API can be used. For information about accessing a custom data access protocol, see Using a Custom Protocol.

/* ---- Read/Write function API ------*/
CALLED_AS_EXTPROTOCOL(fcinfo)
EXTPROTOCOL_GET_URL(fcinfo)(fcinfo) 
EXTPROTOCOL_GET_DATABUF(fcinfo) 
EXTPROTOCOL_GET_DATALEN(fcinfo) 
EXTPROTOCOL_GET_SCANQUALS(fcinfo) 
EXTPROTOCOL_GET_USER_CTX(fcinfo) 
EXTPROTOCOL_IS_LAST_CALL(fcinfo) 
EXTPROTOCOL_SET_LAST_CALL(fcinfo) 
EXTPROTOCOL_SET_USER_CTX(fcinfo, p)

/* ------ Validator function API ------*/
CALLED_AS_EXTPROTOCOL_VALIDATOR(fcinfo)
EXTPROTOCOL_VALIDATOR_GET_URL_LIST(fcinfo) 
EXTPROTOCOL_VALIDATOR_GET_NUM_URLS(fcinfo) 
EXTPROTOCOL_VALIDATOR_GET_NTH_URL(fcinfo, n) 
EXTPROTOCOL_VALIDATOR_GET_DIRECTION(fcinfo)

Notes

The protocol corresponds to the example described in Using a Custom Protocol. The source code file name and shared object are gpextprotocol.c and gpextprotocol.so.

The protocol has the following properties:

  • The name defined for the protocol is myprot.
  • The protocol has the following simple form: the protocol name and a path, separated by ://.

    myprot://path

  • Three functions are implemented:
    • myprot_import() a read function
    • myprot_export() a write function
    • myprot_validate_urls() a validation function

    These functions are referenced in the CREATE PROTOCOL statement when the protocol is created and declared in the database.

The example implementation gpextprotocal.c uses fopen() and fread() to simulate a simple protocol that reads local files. In practice, however, the protocol would implement functionality such as a remote connection to some process over the network.

Installing the External Table Protocol

To use the example external table protocol, you use the C compiler cc to compile and link the source code to create a shared object that can be dynamically loaded by Greenplum Database. The commands to compile and link the source code on a Linux system are similar to this:

cc -fpic -c gpextprotocal.c cc -shared -o gpextprotocal.so gpextprotocal.o

The option -fpic specifies creating position-independent code (PIC) and the -c option compiles the source code without linking and creates an object file. The object file needs to be created as position-independent code (PIC) so that it can be loaded at any arbitrary location in memory by Greenplum Database.

The flag -shared specifies creating a shared object (shared library) and the -o option specifies the shared object file name gpextprotocal.so. Refer to the GCC manual for more information on the cc options.

The header files that are declared as include files in gpextprotocal.c are located in subdirectories of $GPHOME/include/postgresql/.

For more information on compiling and linking dynamically-loaded functions and examples of compiling C source code to create a shared library on other operating systems, see the Postgres documentation at http://www.postgresql.org/docs/8.4/static/xfunc-c.html#DFUNC.

The manual pages for the C compiler cc and the link editor ld for your operating system also contain information on compiling and linking source code on your system.

The compiled code (shared object file) for the custom protocol must be placed in the same location on every host in your Greenplum Database array (master and all segments). This location must also be in the LD_LIBRARY_PATH so that the server can locate the files. It is recommended to locate shared libraries either relative to $libdir (which is located at $GPHOME/lib) or through the dynamic library path (set by the dynamic_library_path server configuration parameter) on all master segment instances in the Greenplum array. You can use the Greenplum Database utilities gpssh and gpscp to update segments.

gpextprotocal.c

#include "postgres.h"
#include "fmgr.h"
#include "funcapi.h"
#include "access/extprotocol.h"
#include "catalog/pg_proc.h"
#include "utils/array.h"
#include "utils/builtins.h"
#include "utils/memutils.h"

/* Our chosen URI format. We can change it however needed */
typedef struct DemoUri
{
       char      *protocol;
       char      *path;
}  DemoUri;

static DemoUri *ParseDemoUri(const char *uri_str);
static void FreeDemoUri(DemoUri *uri);
/* Do the module magic dance */
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(demoprot_export);
PG_FUNCTION_INFO_V1(demoprot_import);
PG_FUNCTION_INFO_V1(demoprot_validate_urls);

Datum demoprot_export(PG_FUNCTION_ARGS);
Datum demoprot_import(PG_FUNCTION_ARGS);
Datum demoprot_validate_urls(PG_FUNCTION_ARGS);

/* A user context that persists across calls. Can be declared in any other way */
typedef struct
{
      char     *url;
      char     *filename;
      FILE     *file;
} extprotocol_t;

/*
 * The read function - Import data into GPDB.
 */
Datum
myprot_import(PG_FUNCTION_ARGS)
{
    extprotocol_t   *myData;
      char            *data;
      int             datlen;
      size_t          nread  =  0;   

    /* Must be called via the external table format manager */   
    if  (!CALLED_AS_EXTPROTOCOL(fcinfo))
            elog(ERROR, "myprot_import: not called by external protocol manager");   

    /* Get our internal description of the protocol */   
    myData = (extprotocol_t *) EXTPROTOCOL_GET_USER_CTX(fcinfo);
      if (EXTPROTOCOL_IS_LAST_CALL(fcinfo))    {     

                /* we're done receiving data. close our connection */
                    if (myData & & myData - > file)
                      if (fclose(myData - > file))
                            ereport(ERROR,(errcode_for_file_access(),
                                           errmsg("could not close file \"%s\": %m",
                                           myData - > filename)));
                             PG_RETURN_INT32(0);
                      
                                                           }
              if (myData == NULL)
                      
                {
                        /* first call. do any desired init */
                        const char     *p_name = "myprot";
                       DemoUri        *parsed_url;
                      char           *url = EXTPROTOCOL_GET_URL(fcinfo);
                        myData         = palloc(sizeof(extprotocol_t));
                        myData - > url    = pstrdup(url);
                        parsed_url     = ParseDemoUri(myData - > url);
                        myData - > filename = pstrdup(parsed_url - > path);

                       if (strcasecmp(parsed_url - > protocol, p_name) != 0)
                              elog(ERROR, "internal error: myprot called with a different protocol (%s)",
                                    parsed_url - > protocol);
                        FreeDemoUri(parsed_url);
                      
                    /* open the destination file (or connect to remote server in other cases) */
                        myData - > file = fopen(myData - > filename, "r");
                        if (myData - > file == NULL)
                              ereport(ERROR,
                                  (errcode_for_file_access(),
                                   errmsg("myprot_import: could not open file \"%s\"
                                      for reading: %m",
                                       myData - > filename),
                                   errOmitLocation(true)));

                        EXTPROTOCOL_SET_USER_CTX(fcinfo, myData);
                      
                }
  /* ==========================================
   *          DO THE IMPORT
   * ========================================== */
      data     = EXTPROTOCOL_GET_DATABUF(fcinfo);
      datlen   = EXTPROTOCOL_GET_DATALEN(fcinfo);

      /* read some bytes (with fread in this example, but normally
     in some other method over the network) */
      if (datlen > 0)
          
    {
            nread = fread(data, 1, datlen, myData - > file);
            if (ferror(myData - > file))
                  ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("myprot_import: could not write to file \"%s\": %m",
                           myData - > filename)));
          
    }
      PG_RETURN_INT32((int)nread);
}

/*
 * Write function - Export data out of GPDB
 */
Datum
myprot_export(PG_FUNCTION_ARGS)
{

      extprotocol_t  *myData;
      char           *data;
      int            datlen;
      size_t         wrote = 0;
      /* Must be called via the external table format manager */
      if (!CALLED_AS_EXTPROTOCOL(fcinfo))
            elog(ERROR, "myprot_export: not called by external protocol manager");
      /* Get our internal description of the protocol */
      myData = (extprotocol_t *) EXTPROTOCOL_GET_USER_CTX(fcinfo);
      if (EXTPROTOCOL_IS_LAST_CALL(fcinfo))

          
    {
            /* we're done sending data. close our connection */
            if (myData & & myData - > file)
                  if (fclose(myData - > file))
                        ereport(ERROR,
                            (errcode_for_file_access(),
                             errmsg("could not close file \"%s\": %m",
                                 myData - > filename)));

            PG_RETURN_INT32(0);

          
    }
      if (myData == NULL)

          
    {
            /* first call. do any desired init */
            const char  *p_name = "myprot";
            DemoUri     *parsed_url;
            char        *url = EXTPROTOCOL_GET_URL(fcinfo);
            myData          = palloc(sizeof(extprotocol_t));
            myData - > url      = pstrdup(url);

            parsed_url       = ParseDemoUri(myData - > url);

            myData - > filename = pstrdup(parsed_url - > path);
            if (strcasecmp(parsed_url - > protocol, p_name) != 0)

                  elog(ERROR, "internal error: myprot called with a different protocol (%s)",
                        parsed_url - > protocol);
            FreeDemoUri(parsed_url);

            /* open the destination file (or connect to remote server in other cases) */
            myData - > file = fopen(myData - > filename, "a");
            if (myData - > file == NULL)
                  ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("myprot_export: could not open file \"%s\" for writing: %m",
                           myData - > filename),
                       errOmitLocation(true)));
            EXTPROTOCOL_SET_USER_CTX(fcinfo, myData);
          
    }

      /* ========================================
   *      DO THE EXPORT
   * ======================================== */
      data   = EXTPROTOCOL_GET_DATABUF(fcinfo);
      datlen   = EXTPROTOCOL_GET_DATALEN(fcinfo);
      if (datlen > 0)
          
    {
            wrote = fwrite(data, 1, datlen, myData - > file);
            if (ferror(myData - > file))
                  ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("myprot_import: could not read from file \"%s\": %m",
                           myData - > filename)));
          
    }
      PG_RETURN_INT32((int)wrote);
}

Datum
myprot_validate_urls(PG_FUNCTION_ARGS)
{
      List         *urls;
      int          nurls;
      int          i;
      ValidatorDirection  direction;
      /* Must be called via the external table format manager */
      if (!CALLED_AS_EXTPROTOCOL_VALIDATOR(fcinfo))
            elog(ERROR, "myprot_validate_urls: not called by external
                            protocol manager");
      nurls       = EXTPROTOCOL_VALIDATOR_GET_NUM_URLS(fcinfo);
      urls         = EXTPROTOCOL_VALIDATOR_GET_URL_LIST(fcinfo);
      direction   = EXTPROTOCOL_VALIDATOR_GET_DIRECTION(fcinfo);
      /*
   * Dumb example 1: search each url for a substring
   * we don't want to be used in a url. in this example
   * it's 'secured_directory'.
   */
      for (i = 1 ; i < = nurls ; i++)
          
    {
            char *url = EXTPROTOCOL_VALIDATOR_GET_NTH_URL(fcinfo, i);
            if (strstr(url, "secured_directory") != 0)
                {
                        ereport(ERROR,
                                (errcode(ERRCODE_PROTOCOL_VIOLATION),
                                 errmsg("using 'secured_directory' in a url isn't allowed ")));
                
        }
          
    }

      /*
   * Dumb example 2: set a limit on the number of urls
   * used. In this example we limit readable external
   * tables that use our protocol to 2 urls max.
   */
      if (direction == EXT_VALIDATE_READ && nurls > 2)
          
    {
                ereport(ERROR,
                        (errcode(ERRCODE_PROTOCOL_VIOLATION),
                         errmsg("more than 2 urls aren't allowed in this protocol ")));
          
    }
      PG_RETURN_VOID();
}

/* --- utility functions --- */

static
DemoUri *ParseDemoUri(const char *uri_str)
{
      DemoUri *uri = (DemoUri *) palloc0(sizeof(DemoUri));
      int     protocol_len;
      uri - > path = NULL;
      uri - > protocol = NULL;

      /*
   * parse protocol
   */
      char *post_protocol = strstr(uri_str, "://");
      if (!post_protocol)
          
    {
            ereport(ERROR,
                (errcode(ERRCODE_SYNTAX_ERROR),
                 errmsg("invalid protocol URI \'%s\'", uri_str),
                 errOmitLocation(true)));
          
    }
      protocol_len = post_protocol - uri_str;
      uri - > protocol = (char *)palloc0(protocol_len + 1);
      strncpy(uri - > protocol, uri_str, protocol_len);
      /* make sure there is more to the uri string */
      if (strlen(uri_str) < = protocol_len)
            ereport(ERROR,
                                (errcode(ERRCODE_SYNTAX_ERROR),
                                             errmsg("invalid myprot URI \'%s\' : missing path",
                                                       uri_str),
                                             errOmitLocation(true)));
      /* parse path */

      uri - > path = pstrdup(uri_str + protocol_len + strlen("://"));
      return uri;
}

static
void FreeDemoUri(DemoUri *uri)
{
      if (uri - > path)
            pfree(uri - > path);
      if (uri - > protocol)
            pfree(uri - > protocol);
      pfree(uri);
}