Running Greenplum Loader

Running Greenplum Loader

Before You Begin

Before you can run Greenplum loader:

  1. Make sure you have installed and configured Python and the Greenplum loader programs. See Installing Greenplum Loader.
  2. Make sure that you have network access to and from all hosts in your Greenplum Database array (master and segments), and to and from the hosts where the data to be loaded resides (if not on the local machine).
  3. Make sure that the ports you declare in your load control file are unblocked by your Windows firewall.
  4. Make sure your Greenplum Database system is up and running and that you know all of the connection information (host name, port, role name, database name, etc.).
  5. Create your database, schema, and table structures in Greenplum Database prior to loading data.
  6. Prepare your data so that it is in a format acceptable by Greenplum loader. See Formatting the Input Data.
  7. Write your control file. The control file specifies the source data, load rules, and target table for loading the data. See Creating the Load Control File.

Creating the Load Control File

Before you can run Greenplum loader (gpload.py), you must create a text file that contains the load specification information. This file must be in valid YAML 1.1 document format and use the Greenplum schema for defining the various steps of a load operation. See gpload.py for details on the correct control file format and schema.

Here is an example of a load control file:

---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - etl1-1
           - etl1-2
           - etl1-3
           - etl1-4
         PORT: 8081
         FILE: 
           - /var/load/data/*
    - COLUMNS:
           - name: text
           - amount: float4
           - category: text
           - desc: text
           - date: date
    - FORMAT: text
    - DELIMITER: '|'
    - ERROR_LIMIT: 25
    - ERROR_TABLE: payables.err_expenses
   OUTPUT:
    - TABLE: payables.expenses
    - MODE: INSERT
   SQL:
   - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
   - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

Formatting the Input Data

When you use Greenplum loader, you need to specify how your data is formatted. Data can be in either delimited text (TEXT) or comma separated values (CSV) format. External data must be formatted correctly in order to be read by Greenplum Database. This section explains the format of data files expected by Greenplum Database.

Formatting Rows

Greenplum Database expects rows of data to be separated by the LF character (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CR+LF, 0x0D 0x0A). LF is the standard newline representation on UNIX or UNIX-like operating systems. Other operating systems (such as Windows or Mac OS 9) may use CR individually, or CR+LF. All of these representations of a newline are supported by Greenplum Database as a row delimiter.

Formatting Columns

The default column or field delimiter is the horizontal TAB character (0x09) for text files and the comma character (0x2C) for CSV files. However, it is possible to declare another single character delimiter using the DELIMITER attribute in the load configuration file. The delimiter character must only appear between any two data value fields. Do not place a delimiter at the beginning or end of a row. For example, if using the pipe character ( | ) as your delimiter:

data value 1 | data value 2 | data value 3

Representing NULL Values

NULL is the value used to represent an unknown piece of data in a column or field. Within your data files you can designate a string to represent null values. The default string is \N (backslash-N) in TEXT mode, or an empty value with no quotations in CSV mode. You can also declare a different string using the NULL attribute in the load configuration file. For example, you might prefer an empty string for cases where you do not want to distinguish nulls from empty strings. When using the Greenplum Database loading tools, any data item that matches the designated null string will be considered a null value.

Escaping

The data file has two reserved characters that have special meaning to Greenplum Database:

  • The designated delimiter character, which is used to separate columns or fields in the data file.
  • The newline character used to designate a new row in the data file.

If your data contains either of these characters, you must escape the character so Greenplum treats it as data and not as a field separator or new row. By default, the escape character is a backslash (\) for text-formatted files and a double quote (") for csv-formatted files.

Escaping in Text Formatted Files

By default, the escape character is a backslash (\) for text-formatted files. If you want to use a different escape character, use the ESCAPE attribute in the load configuration file. In cases where your selected escape character is present in your data, you can use it to escape itself.

For example, suppose you have a table with three columns and you want to load the following three fields:

  • backslash = \
  • vertical bar = |
  • exclamation point = !

Your designated delimiter character is pipe (|), and your designated escape character is backslash (\). The formatted row in your data file would look like this:

backslash = \\ | vertical bar = \| | exclamation point = !

Notice how the backslash character that is part of the data is escaped with another backslash character, and the pipe character that is part of the data is escaped with a backslash character.

The escape character can also be used to escape octal and hexidecimal sequences. When used in this way, the escaped value will get converted to the equivalent character when loaded into Greenplum Database. For example, to load the ampersand character (&), you could use the escape character to escape its equivalent hexidecimal (\0x26) or octal (\046) representation.

If there is no need to escape the data in text-formatted files, you can disable escaping using the ESCAPE clause of the COPY and CREATE EXTERNALTABLE commands or the ESCAPE attribute of the load control file for gpload.py as follows:

ESCAPE 'OFF'

This is useful for input data that contains a lot of backslash characters within the data itself (such as web log data).

Escaping in CSV Formatted Files

By default, the escape character is a double quote (") for CSV-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY and CREATE EXTERNALTABLE commands or the ESCAPE attribute of the load control file for gpload.py to declare a different escape character. In cases where your selected escape character is present in your data, you can use it to escape itself.

For example, suppose you have a table with three columns and you want to load the following three fields:

  • Free trip to A,B
  • 5.89
  • Special rate "1.79"

Your designated delimiter character is comma (,), and your designated escape character is double quote ("). The formatted row in your data file would look like this:

"Free trip to A,B" ,"5.89" ,"Special rate ""1.79"""

Notice how that for the comma character that is part of the data, the entire data value is enclosed in double quotes. Also notice how the double quotes that are part of the data are also escaped with a double quote even though the field value is enclosed in double quotes.

Embedding the entire field inside a set of double quotes also guarantees preservation of leading and trailing whitespace characters:
"Free trip to A,B " ,"5.89 " ,"Special rate ""1.79"" "

Character Encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of numbers or octets, in order to facilitate the transmission and storage of data. The character set support in Greenplum Database allows you to store text in a variety of character sets, including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended UNIX Code), UTF-8, and Mule internal code. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding).

Data files must be in a character encoding recognized by Greenplum Database. See the Greenplum Database Reference Guide for the supported character sets. Data files that contain invalid or unsupported encoding sequences will encounter errors when loading into Greenplum Database.

Note: On data files generated on a Microsoft Windows operating system, try running the dos2unix system command to remove any Windows-only characters prior to loading into Greenplum Database.

Running Greenplum Loader

Greenplum loader is invoked by running the gpload.py program from a Windows command-line session. For complete command syntax and options, see gpload.py.

Greenplum Loader Log Files

By default, gpload.py creates a directory called gpAdminLogs in the same location from where you execute the program and writes its log files there. Alternatively, you can use the -l option when executing gpload.py to direct the log output to a different location. See gpload.py for the format of these log files.

Updating Database Statistics After Data Loads

After loading data, always run the ANALYZE SQL command to update the database statistics used by the query planner. ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. The query planner uses these statistics to help determine the most efficient execution plans for queries. For example, to collect statistics on a newly loaded table, run the following on the Greenplum master host:

psql dbname -c 'ANALYZE mytable;'

Vacumming the Database After Load Errors

The Greenplum loader will stop a load operation if it encounters an error. When this happens, the target table may already have received earlier rows in the load operation. Although these rows will not be visible or accessible, they still occupy disk space. This may amount to a considerable amount of wasted disk space if the failure happened well into a large load operation. You may wish to invoke the VACUUM command to recover the wasted space. For example, run the following command on the master host after a load error:

vacummdb dbname [table_name]

VACUUM reclaims storage occupied by deleted tuples. In normal operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's recommended to do VACUUM periodically, especially on frequently-updated tables.