Overview of the Greenplum-Kafka Integration
Overview of the Greenplum-Kafka Integration
Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Kafka is a fault-tolerant, low-latency, distributed publish-subscribe message system. The Greenplum-Kafka Integration uses the Greenplum Streaming Server to provide high speed, parallel data transfer from Apache Kafka to Greenplum Database to support a streaming ETL pipeline.
The Greenplum-Kafka Integration supports the Apache and Confluent Kafka distributions. Refer to the Apache Kafka Documentation for more information about Apache Kafka.
A Kafka message may include a key and a value. Kafka stores streams of messages (or records) in categories called topics. A Kafka producer publishes records to partitions in one or more topics. A Kafka consumer subscribes to a topic and receives records in the order that they were sent within a given Kafka partition. Kafka does not guarantee the order of data originating from different Kafka partitions.
The Greenplum-Kafka Integration includes the gpkafka utility.
- gpkafka load - Load Kafka data into Greenplum.
- gpkafka history - Check the commit history of a load operation.
The gpkafka utility is a Kafka consumer. It ingests streaming data from a single Kafka topic, using Greenplum Database readable external tables to transform and insert or update the data into a target Greenplum table. You identify the Kafka source, data format, and the Greenplum connection options and target table definition in a YAML-formatted load configuration file that you provide to the utility. In the case of user interrupt or exit, gpkafka resumes a subsequent data load operation specifying the same Kafka topic and target Greenplum Database table names from the last recorded offset.
Requirements
PROPERTIES: api.version.request: false broker.version.fallback: 0.8.2.1
Limitations
The Greenplum-Kafka Integration has the following limitations:
- The Greenplum-Kafka Integration supports loading from a single Kafka topic to a single Greenplum Database table. You must pre-create the Greenplum table.