A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.
Runs a load job as defined in a YAML formatted control file.
gpload -f control_file [-l log_file] [-h hostname] [-p port] [-U username] [-d database] [-W] [--gpfdist_timeout seconds] [--no_auto_trans] [[-v | -V] [-q]] [-D] gpload -? gpload --version
The client machine where gpload is executed must have the following:
- Python 2.6.2 or later, pygresql (the Python
interface to PostgreSQL), and pyyaml. Note that Python and the
required Python libraries are included with the Greenplum Database server
installation, so if you have Greenplum Database installed on the machine where
gpload is running, you do not need a separate Python
installation.Note: Greenplum Loaders for Windows supports only Python 2.5 (available from www.python.org).
- The gpfdist parallel file distribution program installed and in your $PATH. This program is located in $GPHOME/bin of your Greenplum Database server installation.
- Network access to and from all hosts in your Greenplum Database array (master and segments).
- Network access to and from the hosts where the data to be loaded resides (ETL servers).
gpload is a data loading utility that acts as an interface to Greenplum Database's external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum parallel file server (gpfdist), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database.
The operation, including any SQL commands specified in the SQL collection of the YAML control file (see Control File Format), are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.
- -f control_file
- Required. A YAML file that contains the load specification details. See Control File Format.
- --gpfdist_timeout seconds
- Sets the timeout for the gpfdist parallel file distribution program to send a response. Enter a value from 0 to 30 seconds (entering "0" to disables timeouts). Note that you might need to increase this value when operating on high-traffic networks.
- -l log_file
- Specifies where to write the log file. Defaults to ~/gpAdminLogs/gpload_YYYYMMDD. For more information about the log file, see Log File Format.
- Specify --no_auto_trans to disable processing the load operation as a single transaction if you are performing a single load operation on the target table.
- By default, gpload processes each load operation as a single transaction to prevent inconsistent data when performing multiple, simultaneous operations on a target table.
- -q (no screen output)
- Run in quiet mode. Command output is not displayed on the screen, but is still written to the log file.
- -D (debug mode)
- Check for error conditions, but do not execute the load.
- -v (verbose mode)
- Show verbose output of the load steps as they are executed.
- -V (very verbose mode)
- Shows very verbose output.
- -? (show help)
- Show help, then exit.
- Show the version of this utility, then exit.
- -d database
- The database to load into. If not specified, reads from the load control file, the environment variable $PGDATABASE or defaults to the current system user name.
- -h hostname
- Specifies the host name of the machine on which the Greenplum master database server is running. If not specified, reads from the load control file, the environment variable $PGHOST or defaults to localhost.
- -p port
- Specifies the TCP port on which the Greenplum master database server is listening for connections. If not specified, reads from the load control file, the environment variable $PGPORT or defaults to 5432.
- -U username
- The database role name to connect as. If not specified, reads from the load control file, the environment variable $PGUSER or defaults to the current system user name.
- -W (force password prompt)
- Force a password prompt. If not specified, reads the password from the environment variable $PGPASSWORD or from a password file specified by $PGPASSFILE or in ~/.pgpass. If these are not set, then gpload will prompt for a password even if -W is not supplied.
Control File Format
The gpload control file uses the YAML 1.1 document format and then implements its own schema for defining the various steps of a Greenplum Database load operation. The control file must be a valid YAML document.
The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all.
The basic structure of a load control file is:
--- VERSION: 220.127.116.11 DATABASE: db_name USER: db_username HOST: master_hostname PORT: master_port GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - hostname_or_ip PORT: http_port | PORT_RANGE: [start_port_range, end_port_range] FILE: - /path/to/input_file SSL: true | false CERTIFICATES_PATH: /path/to/certificates - COLUMNS: - field_name: data_type - TRANSFORM: 'transformation' - TRANSFORM_CONFIG: 'configuration-file-path' - MAX_LINE_LENGTH: integer - FORMAT: text | csv - DELIMITER: 'delimiter_character' - ESCAPE: 'escape_character' | 'OFF' - NULL_AS: 'null_string' - FORCE_NOT_NULL: true | false - QUOTE: 'csv_quote_character' - HEADER: true | false - ENCODING: database_encoding - ERROR_LIMIT: integer - ERROR_TABLE: schema.table_name EXTERNAL: - SCHEMA: schema | '%' OUTPUT: - TABLE: schema.table_name - MODE: insert | update | merge - MATCH_COLUMNS: - target_column_name - UPDATE_COLUMNS: - target_column_name - UPDATE_CONDITION: 'boolean_condition' - MAPPING: target_column_name: source_column_name | 'expression' PRELOAD: - TRUNCATE: true | false - REUSE_TABLES: true | false SQL: - BEFORE: "sql_command" - AFTER: "sql_command"
- Optional. The version of the gpload control file schema. The current version is 18.104.22.168.
- Optional. Specifies which database in Greenplum to connect to. If not specified, defaults to $PGDATABASE if set or the current system user name. You can also specify the database on the command line using the -d option.
- Optional. Specifies which database role to use to connect. If not specified, defaults to the current user or $PGUSER if set. You can also specify the database role on the command line using the -U option.
- If the user running gpload is not a Greenplum superuser, then the server configuration parameter gp_external_grant_privileges must be set to on in order for the load to be processed. See the Greenplum Database Reference Guide for more information.
- Optional. Specifies Greenplum master host name. If not specified, defaults to localhost or $PGHOST if set. You can also specify the master host name on the command line using the -h option.
- Optional. Specifies Greenplum master port. If not specified, defaults to 5432 or $PGPORT if set. You can also specify the master port on the command line using the -p option.
- Required. Begins the load specification section. A GPLOAD
specification must have an INPUT and an
OUTPUT section defined.
- Required. Defines the location and
the format of the input data to be loaded.
gpload will start one or more instances of
the gpfdist file distribution program on
the current host and create the required external table
definition(s) in Greenplum Database that point to the source
data. Note that the host from which you run
gpload must be accessible over the network
by all Greenplum hosts (master and segments).
- Required. The
SOURCE block of an
INPUT specification defines the
location of a source file. An INPUT
section can have more than one
SOURCE block defined. Each
SOURCE block defined corresponds
to one instance of the gpfdist file distribution program
that will be started on the local machine. Each
SOURCE block defined must have a
FILE specification. For more information about using the gpfdist parallel file server and single and multiple gpfdist instances, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
- Optional. Specifies the specific port number
that the gpfdist file
distribution program should use. You can also
supply a PORT_RANGE to select an
available port from the specified range. If both
PORT_RANGE are defined, then
PORT takes precedence. If neither
PORT_RANGE are defined, the
default is to select an available port between
8000 and 9000.
- Required. Specifies the location of a file,
named pipe, or directory location on the local
file system that contains data to be loaded. You
can declare more than one file so long as the data
is of the same format in all files specified.
- Required when SSL is true;
cannot be specified when SSL is
false or unspecified. The
location specified in
CERTIFICATES_PATH must contain
the following files:
- The server certificate file, server.crt
- The server private key file, server.key
- The trusted certificate authorities, root.crt
The root directory (/) cannot be specified as CERTIFICATES_PATH.
- Optional. Specifies
the schema of the source data file(s) in the format
The DELIMITER character in the
source file is what separates two data value fields
(columns). A row is determined by a line feed
- Optional. Enables single row error isolation mode for this load operation. When enabled, input rows that have format errors will be discarded provided that the error limit count is not reached on any Greenplum segment instance during input processing. If the error limit is not reached, all good rows will be loaded and any error rows will either be discarded or logged to the table specified in ERROR_TABLE. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with format errors; for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered. For information about handling load errors, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
- Optional when ERROR_LIMIT is declared. Specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode. You can then examine this error table to see error rows that were not loaded (if any). If the ERROR_TABLE specified already exists, it will be used. If it does not exist, it will be automatically generated. For more information about handling load errors, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
- Optional. Defines the schema of the external table database objects created by gpload.
- The default is to use the Greenplum Database search_path.
- Required. Defines the target table and final
data column values that are to be loaded into the
- Required. The name of the target table to load into.
- Optional. Defaults to INSERT if not specified. There are three available load modes:
INSERT - Loads data into the target
table using the following method:
INSERT INTO target_table SELECT * FROM input_data;
- target_column_name: source_column_name
- target_column_name: 'expression'
- Optional. Specifies operations to run prior to the load operation. Right now
the only preload operation is TRUNCATE.
- Optional. If set to true, gpload will remove all rows in the target table prior to loading it.
- Optional. If set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves performance of trickle loads (ongoing small loads to the same target table).
- Optional. Defines SQL commands to run before and/or after the load
operation. You can specify multiple BEFORE and/or
AFTER commands. List commands in the order of desired
- Optional. An SQL command to run before the load operation starts. Enclose commands in quotes.
- Optional. An SQL command to run after the load operation completes. Enclose commands in quotes.
Log File Format
Log files output by gpload have the following format:
Where timestamp takes the form: YYYY-MM-DD HH:MM:SS, level is one of DEBUG, LOG, INFO, ERROR, and message is a normal text message.
Some INFO messages that may be of interest in the log files are (where # corresponds to the actual number of seconds, units of data, or failed rows):
INFO|running time: #.## seconds INFO|transferred #.# kB of #.# kB. INFO|gpload succeeded INFO|gpload succeeded with warnings INFO|gpload failed INFO|1 bad row INFO|# bad rows
If your database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the gpload control file. For example, if you create a table as follows:
CREATE TABLE "MyTable" ("MyColumn" text);
Your YAML-formatted gpload control file would refer to the above table and column names as follows:
- COLUMNS: - '"MyColumn"': text OUTPUT: - TABLE: public.'"MyTable"'
Run a load job as defined in my_load.yml:
gpload -f my_load.yml
Example load control file:
--- VERSION: 22.214.171.124 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - etl1-1 - etl1-2 - etl1-3 - etl1-4 PORT: 8081 FILE: - /var/load/data/* - COLUMNS: - name: text - amount: float4 - category: text - desc: text - date: date - FORMAT: text - DELIMITER: '|' - ERROR_LIMIT: 25 - ERROR_TABLE: payables.err_expenses OUTPUT: - TABLE: payables.expenses - MODE: INSERT SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
gpfdist, CREATE EXTERNAL TABLE in the Greenplum Database Reference Guide