gpsscli load

gpsscli load

Load data with the Greenplum Streaming Server.

Synopsis

gpsscli load jobconfig.yaml [...]
     [--name job_name]
     [-f | --force] [--quit-at-eof] [--partition]
     [{--force-reset-earliest | --force-reset-latest | --force-reset-timestamp tstamp}]
     [-p | --property template_var=value]
     [--config gpsscliconfig.json]
     [--gpss-host host] [--gpss-port port]
     [--no-check-ca] [-l | --log-dir directory] [--verbose]

gpsscli load {-h | --help}

Description

The gpsscli load command initiates a load job to a specific Greenplum Streaming Server (GPSS) instance. When you run gpsscli load, the command submits, starts, and displays the progress of a GPSS job.

You provide one or more YAML-formatted configuration files that define the job parameters when you run the command. When you specify a single load configuration file, you may choose a name to identify the job. If you do not provide a name, GPSS uses the base name of load configuration file as the job identifier. For example, if you invoke this command with the load configuration file /dir/jobconfig.yaml and do not provide the --name option, GPSS assigns the job the identifier jobconfig.

By default, gpsscli load loads all available data and then waits indefinitely for new messages to load. In the case of user interrupt or exit, the GPSS job remains in the Running state. You must explicitly stop the job with gpsscli stop when running in this mode.

When you provide the --quit-at-eof option to the command, the utility exits after it reads all published data, writes the data to Greenplum Database, and stops the job. The GPSS job is in the Stopped state when the command returns.

If gpsscli load detects an offset mismatch when loading from a Kafka data source, you can choose to resume a load operation from the earliest available data. Or, you may choose to load only new data, or data emitted since a specific time.

If the GPSS instance to which you want to send the request is not running on the default host (127.0.0.1) or the default port number (5000), you can specify the GPSS host and/or port via command line options.

Options

jobconfig.yaml [...]
One or more YAML-formatted configuration files that define the parameters of the job. If a filename provided is not an absolute path, Greenplum Database assumes the file system location is relative to the current working directory.
Note: GPSS uses the properties in a YAML configuration file to uniquely identify a load operation. Submit a configuration file only once. If you submit the same configuration file more than once, GPSS will create the job, but it will eventually error out.
--name job_name
Use job_name to identify the job. If you do not provide a name, the default job identifier is the base name of the load configuration file. Job names must be unique.
Note: GPSS does not support specifying a job_name when you provide more than one jobconfig.yaml load configuration file to the command.
-f | --force
Force GPSS to reload the configuration of a running job. GPSS stops the job, updates the job with the configuration specified in in jobconfig.yaml, and then restarts the job. If you previously named the job, you must provide --name job_name when you force job configuration reload with this option.
Note: Do not attempt to update a configuration property that GPSS uses to uniquely identify a job. If you change any such configuration property, GPSS creates a new internal job and loads all available messages.
--quit-at-eof
When you specify this option, gpsscli load exits after it reads all of the source data. The default behaviour of gpsscli load is to wait indefinitely for, and then consume, new data from the source.
gpsscli load ignores job retry SCHEDULE configuration settings when it is invoked with the --quit-at-eof flag.
--force-reset-earliest
gpsscli load returns an error if its recorded offset does not match that of the Kafka data source. Re-run gpsscli load and specify the --force‑reset‑earliest option to resume the load operation from the earliest available data offset known to the data source.
Note: gpsscli load supports this option only when loading from a Kafka data source.
Note: --force-reset-earliest specified on the command line takes precedence over a FALLBACK_OFFSET/fallback_offset set in the jobconfig.yaml.
--force-reset-latest
gpsscli load returns an error if its recorded offset does not match that of the Kafka data source. Re-run gpsscli load and specify the --force‑reset‑latest option to load only new data emitted from the data source.
Note: gpsscli load supports this option only when loading from a Kafka data source.
Note: --force-reset-latest specified on the command line takes precedence over a FALLBACK_OFFSET/fallback_offset set in the jobconfig.yaml.
--force-reset-timestamp tstamp
Specify the --force‑reset‑timestamp option to load Kafka messages published since the specified time. tstamp must specify epoch time in milliseconds, and is bounded by the earliest message time and the current time.
Note: gpsscli load supports this option only when loading from a Kafka data source.
--partition
By default, GPSS outputs the Kafka job progress by batch, and displays the start and end times, the message number and size, the number of inserted and rejected rows, and the transfer speed per batch. When you specify the --partition option, GPSS outputs the job progress by partition, and displays the partition identifier, the start and end times, the beginning and ending offsets, the message size, and the transfer speed per partition.
Note: gpsscli load supports this option only when loading from a Kafka data source.
-p | --property template_var=value
Substitute value for instances of the property value template {{template_var}} referenced in the jobconfig.yaml load configuration file.
--config gpsscliconfig.json
The GPSS configuration file. This file includes properties that identify the gpss instance that services the command. When SSL encryption is enabled between the GPSS client and server, you also use this file to identify the file system location of the client SSL certificates. Refer to gpss.json for detailed information about the format of this file and the configuration properties supported.
Note: gpsscli subcommands read the configuration specified in the ListenAddress block of the gpsscliconfig.json file, and ignore the gpfdist configuration specified in the Gpfdist block of the file.
--color
Enable the use of color when displaying front-end log messages. When specified, GPSS colors the log level in messages that it writes to stdout. Color is disabled by default.
GPSS ignores the --color option if you also specify --csv-log.
--csv-log
Write front-end log messages in CSV format. By default, GPSS writes log messages to stdout using spaces between fields for a more human-readable format.
--gpss-host host
The GPSS host. The default host address is 127.0.0.1. If specified, overrides a ListenAddress:Host value provided in gpsscliconfig.json
--gpss-port port
The GPSS port number. The default port number is 5000. If specified, overrides a ListenAddress:Port value provided in gpsscliconfig.json
--no-check-ca
Disable certificate verification when SSL is enabled between the GPSS client and server. By default, GPSS checks the certificate authority (CA) each time that you invoke a gpsscli subcommand.
-l | --log-dir directory
The directory to which GPSS writes client command log files. GPSS must have write permission to the directory. GPSS creates the log directory if it does not exist.
If you do not provide this option, GPSS writes gpsscli client log files to the $HOME/gpAdminLogs directory.
--verbose
The default behaviour of the command utility is to display information and error messages to stdout. When you specify the --verbose option, GPSS also outputs debug-level messages about the operation.
-h | --help
Show command utility help, and then exit.

Examples

Submit a GPSS load job from Kafka named from_topic1 whose load parameters are defined by the configuration file named loadcfg.yaml:

$ gpsscli load --name from_topic1 loadcfg.yaml