Overview of the Greenplum-Kafka Integration

Overview of the Greenplum-Kafka Integration

Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Kafka is a fault-tolerant, low-latency, distributed publish-subscribe message system. The Greenplum-Kafka Integration uses the Greenplum Streaming Server to provide high speed, parallel data transfer from Apache Kafka to Greenplum Database to support a streaming ETL pipeline.

The Greenplum-Kafka Integration supports the Apache and Confluent Kafka distributions. Refer to the Apache Kafka Documentation for more information about Apache Kafka.

A Kafka message may include a key and a value. Kafka stores streams of messages (or records) in categories called topics. A Kafka producer publishes records to partitions in one or more topics. A Kafka consumer subscribes to a topic and receives records in the order that they were sent within a given Kafka partition. Kafka does not guarantee the order of data originating from different Kafka partitions.

The Greenplum-Kafka Integration includes the gpkafka utility.

Note: gpkafka is a wrapper around the Greenplum Streaming Server (GPSS) gpss and gpsscli commands. Pivotal recommends that you migrate to using the GPSS utilities directly.
gpkafka supports two subcommands for backwards compatibility:

The gpkafka utility is a Kafka consumer. It ingests streaming data from a single Kafka topic, using Greenplum Database readable external tables to transform and insert or update the data into a target Greenplum table. You identify the Kafka source, data format, and the Greenplum connection options and target table definition in a YAML-formatted load configuration file that you provide to the utility. In the case of user interrupt or exit, gpkafka resumes a subsequent data load operation specifying the same Kafka topic and target Greenplum Database table names from the last recorded offset.


The Greenplum-Kafka Integration requires Kafka version 0.11 or newer for exactly-once delivery assurance. You can run with an older version of Kafka (but lose the exactly-once guarantee) by adding the following PROPERTIES block to your gpkafka.yaml load configuration file:
      api.version.request: false