gpkafka load

gpkafka load

Load data from Kafka into Greenplum Database.

Synopsis

gpkafka load jobconfig.yaml
    [--name job_name]
    [-f | --force] [--quit-at-eof] [--partition]
    [{--force-reset-earliest | --force-reset-latest | --force-reset-timestamp tstamp}]
    [--config gpfdistconfig.json]
    [--gpfdist-host hostaddr] [--gpfdist-port portnum]
    [--debug-port portnum ]
    [-l | --log-dir directory] [--verbose]
gpkafka load {-h | --help} 

Description

Note: gpkafka load is a wrapper around the Greenplum Streaming Server (GPSS) gpss and gpsscli utilities. Starting in Greenplum Streaming Server version 1.3.2, gpkafka load no longer launches a gpss server instance, but rather calls the backend server code directly.

When you run gpkafka load, the command submits, starts, and stops a GPSS job on your behalf.

Pivotal recommends that you migrate to using the GPSS utilities directly.

The gpkafka load utility loads data from a Kafka topic into a Greenplum Database table. When you run the command, you provide a YAML-formatted configuration file that defines load parameters such as the Greenplum Database connection options, the Kafka broker and topic, and the target Greenplum Database table.

gpkafka load uses the gpfdist or gpfdists protocol to load data into Greenplum. You can configure the protocol options by providing a JSON-formatted GPSS configuration file via the --config gpfdistconfig.json option to the command, or by specifying the --gpfdist-host hostaddr and/or --gpfdist-port portnum options.

By default, gpkafka load loads all Kafka messages published to the topic, and then waits indefinitely for new messages to load. When you provide the --quit-at-eof option to the command, the utility exits after it reads all published messages and writes the data to Greenplum Database.

If you provide the --debug-port option, gpkafka load displays debug information to stdout during the load operation and starts a debug server from which you can obtain additional debug information.

In the case of user interrupt or exit, gpkafka load resumes a load operation specifying the same Kafka topic and Greenplum Database table, target schema, and database names from the last recorded offset. If gpkafka detects an offset mismatch, you can choose to resume a load operation from the earliest available offset for the topic. Or, you may choose to load only new messages published to the topic, or messages published since a specific time.

Options

jobconfig.yaml
The Version 1 or Version 2 YAML-formatted configuration file that defines the load operation parameters. If the filename provided is not an absolute path, Greenplum Database assumes the file system location is relative to the current working directory. Refer to gpkafka.yaml and gpkafka-v2.yaml for the format and content of the parameters that you specify in Versions 1 and 2 of this file.
--name job_name
Use job_name to identify the job. If you do not provide a name, the command assigns a unique identifier to the job.
-f | --force
Force gpkafka to reload the configuration of a running job. gpkafka stops the job, updates the job with the configuration specified in jobconfig.yaml, and then restarts the job. If you previously named the job, you must provide --name job_name when you force job configuration reload with this option.
Note: Do not attempt to update a configuration property that gpkafka uses to uniquely identify a Kafka job (the Kafka topic name and the Greenplum database, schema, and table names). If you change any such configuration property, gpkafka creates a new internal job and loads all available messages.
--quit-at-eof
When you specify this option, gpkafka load exits after it reads all of the Kafka messages published to the topic. The default behaviour of gpkafka load is to wait indefinitely for, and then consume, new Kafka messages published to the topic.
gpkafka load ignores job retry SCHEDULE configuration settings when it is invoked with the --quit-at-eof flag.
--partition
By default, gpkafka load outputs the job progress by batch, and displays the start and end times, the message number and size, the number of inserted and rejected rows, and the transfer speed per batch. When you specify the --partition option, gpkafka load outputs the job progress by partition, and displays the partition identifier, the start and end times, the beginning and ending offsets, the message size, and the transfer speed per partition.
--force-reset-earliest
gpkafka load returns an error if its recorded offset does not match the Kafka message offset for the topic. Re-run gpkafka load and specify the --force‑reset‑earliest option to resume the load operation from the earliest available message published to the Kafka topic.
--force-reset-latest
gpkafka load returns an error if its recorded offset does not match the Kafka message offset for the topic. Re-run gpkafka load and specify the --force‑reset‑latest option to load only new data messages published to the Kafka topic.
--force-reset-timestamp tstamp
Specify the --force‑reset‑timestamp option to load Kafka messages published to the topic from the offset associated with the specified time. tstamp must specify epoch time in milliseconds, and is bounded by the earliest message time and the current time.
--config gpfdistconfig.json
The GPSS configuration file. This file includes properties that configure the gpfdist/s protocol used for the load request. Refer to gpss.json for detailed information about the format of this file and the configuration properties supported.
Note: gpkafka load reads the configuration specified in the Gpfdist protocol block of the gpfdistconfig.json file; it ignores the GPSS configuration specified in the ListenAddress block of the file.
--gpfdist-host hostaddr
The gpfdist service host name or IP address that GPSS sets in the external table LOCATION clause. If specified, overrides a Gpfdist:Host value provided in gpfdistconfig.json.
--gpfdist-port portnum
The gpfdist service port number. If specified, overrides a Gpfdist:Port value provided in gpfdistconfig.json.
--debug-port portnum
When you specify this option, gpkafka load includes debug information such as the source code file and line number in messages it writes to stdout. The utility also starts a debug server at the port identified by portnum; additional debug information including the call stack and performance statistics is available via curl http://gpkafkahost:portnum/debug/pprof/.
-l | --log-dir directory
Specify the directory to which gpkafka writes client command log files. gpkafka must have write permission to the directory. gpkafka creates the log directory if it does not exist.
If you do not provide this option, gpkafka writes client log files to the $HOME/gpAdminLogs directory.
--verbose
The default behaviour of the command utility is to display information and error messages to stdout. When you specify the --verbose option, gpkafka also outputs debug-level messages about the operation.
-h | --help
Show command utility help, and then exit.

Examples

Stream Kafka data into Greenplum Database using the load parameters defined in a configuration file named loadcfg.yaml located in the current directory:

gpkafka load loadcfg.yaml

Load Kafka data into Greenplum Database using a configuration file located in the current directory named loadcfg.yaml; exit the load operation after reading all Kafka messages published to the topic:

gpkafka load --quit-at-eof loadcfg.yaml