Overview of the Greenplum-Kafka Integration

Overview of the Greenplum-Kafka Integration

Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Kafka is a fault-tolerant, low-latency, distributed publish-subscribe message system. The Pivotal Greenplum-Kafka Integration uses the Pivotal Greenplum Stream Server to provide high speed, parallel data transfer from Apache Kafka to Greenplum Database to support a streaming ETL pipeline.

The Pivotal Greenplum-Kafka Integration supports the Apache and Confluent Kafka distributions. Refer to the Apache Kafka Documentation for more information about Apache Kafka.

A Kafka message may include a key and a value. Kafka stores streams of messages (or records) in categories called topics. A Kafka producer publishes records to partitions in one or more topics. A Kafka consumer subscribes to a topic and receives records in the order that they were sent within a given Kafka partition. Kafka does not guarantee the order of data originating from different Kafka partitions.

The Greenplum-Kafka Integration includes the gpkafka utility.

Note: Starting in Greenplum Database version 5.16, gpkafka is a wrapper around the Greenplum Stream Server (GPSS) gpss and gpsscli commands. Pivotal recommends that you migrate to using the GPSS commands.
gpkafka supports two subcommands for backwards compatibility:

The gpkafka utility is a Kafka consumer. It ingests streaming data from a single Kafka topic, using Greenplum Database readable external tables to transform and insert the data into a target Greenplum table. You identify the Kafka source, data format, and the Greenplum connection options and target table definition in a YAML-formatted load configuration file that you provide to the utility. In the case of user interrupt or exit, gpkafka resumes a subsequent data load operation specifying the same Kafka topic and target Greenplum Database table names from the last recorded offset.

Requirements

The Greenplum-Kafka Integration requires Kafka version 0.11 or newer for exactly-once delivery assurance. You can run with an older version of Kafka (but lose the exactly-once guarantee) by adding the following PROPERTIES block to your gpkafka.yaml load configuration file:
PROPERTIES:
      api.version.request: false
      broker.version.fallback: 0.8.2.1

Limitations

The Greenplum-Kafka Integration has the following limitations:

  • The Greenplum-Kafka Integration supports loading from a single Kafka topic to a single Greenplum Database table. You must pre-create the Greenplum table.