gpkafka load

gpkafka load

Load data from Kafka into Greenplum Database.

Synopsis

gpkafka load [--quit-at-eof] [[--force-reset-earliest] | [--force-reset-latest]] [--debug-port portnum] [-v | --verbose] config.yaml
gpkafka load {-h | --help} 

Description

The gpkafka load utility loads data from a Kafka topic into a Greenplum Database table. When you run the command, you provide a YAML-formatted configuration file that defines load parameters such as the Greenplum Database connection options, the Kafka broker and topic, and the target Greenplum Database table.

By default, gpkafka load loads all Kafka messages published to the topic, and then waits indefinitely for new messages to load. When you provide the --quit-at-eof option to the command, the utility exits after it reads all published messages and writes the data to Greenplum Database.

If you provide the --debug-port option, gpkafka load displays debug information to stdout during the load operation and starts a debug server from which you can obtain additional debug information.

In the case of user interrupt or exit, gpkafka load resumes a load operation specifying the same Kafka topic and Greenplum Database table names from the last recorded offset. If gpkafka detects an offset mismatch, you can choose to resume a load operation from the earliest available offset for the topic. Or, you may choose to load only new messages published to the topic.

Options

config.yaml
The Version 1 or Version 2 YAML-formatted configuration file that defines the load operation parameters. If the filename provided is not an absolute path, Greenplum Database assumes the file system location is relative to the current working directory. Refer to gpkafka.yaml and gpkafka-v2.yaml for the format and content of the parameters that you specify in Versions 1 and 2 of this file.
--quit-at-eof
When you specify this option, gpkafka load exits after it reads all of the Kafka messages published to the topic. The default behaviour of gpkafka load is to wait indefinitely for, and then consume, new Kafka messages published to the topic.
--force-reset-earliest
gpkafka load returns an error if its recorded offset is behind that of the current earliest Kafka message offset for the topic. Specify the --force-reset-earliest option to resume the load operation from the earliest available message published to the topic.
--force-reset-latest
gpkafka load returns an error if its recorded offset is behind that of the current earliest Kafka message offset for the topic. Specify the --force-reset-latest option to load only new messages published to the Kafka topic.
--debug-port portnum
When you specify this option, gpkafka load includes debug information such as source code file and line number in messages it writes to stdout. The utility also starts a debug server at the port identified by portnum; additional debug information including the call stack and performance statistics is available via curl http://gpkafkahost:portnum.
-v | --verbose
The default behaviour of gpkafka load is to display information and error messages to stdout. When you specify the --verbose option, gpkafka load also outputs additional details about the load operation, including the Kafka partition, batch, and offset status.
-h | --help
Show command help, and then exit.

Examples

Stream Kafka data into Greenplum Database using the load parameters defined in a configuration file named loadcfg.yaml located in the current directory:

gpkafka load loadcfg.yaml

Load Kafka data into Greenplum Database using a configuration file located in the current directory named loadcfg.yaml; exit the load operation after reading all Kafka messages published to the topic:

gpkafka load --quit-at-eof loadcfg.yaml