Overview of the Greenplum-Kafka Integration

Overview of the Greenplum-Kafka Integration

Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Kafka is a fault-tolerant, low-latency, distributed publish-subscribe message system. The Pivotal Greenplum-Kafka Integration provides high speed, parallel data transfer from Apache Kafka to Greenplum Database to support a streaming ETL pipeline.

The Pivotal Greenplum-Kafka Integration supports the Apache and Confluent Kafka distributions. Refer to the Apache Kafka Documentation for more information about Apache Kafka.

A Kafka message may include a key and a value. Kafka stores streams of messages (or records) in categories called topics. A Kafka producer publishes records to partitions in one or more topics. A Kafka consumer subscribes to a topic and receives records in the order that they were sent within a given Kafka partition. Kafka does not guarantee the order of data originating from different Kafka partitions.

The Greenplum-Kafka Integration includes the gpkafka utility. This utility has two subcommands:

The gpkafka utility is a Kafka consumer. It ingests streaming data from a single Kafka topic, using Greenplum Database readable external tables to transform and insert the data into a target Greenplum table. You identify the Kafka source, data format, and the Greenplum connection options and target table definition in a configuration file that you provide to the utility. In the case of user interrupt or exit, gpkafka resumes a subsequent data load operation specifying the same Kafka topic and target Greenplum Database table names from the last recorded offset.

Prerequisites

The Greenplum-Kafka Integration is installed when you install Greenplum Database. Before using the gpkafka utilities to load Kafka data to Greenplum Database, ensure that you:

  • Have access to a running Greenplum Database cluster, and that you can identify the hostname of your master node.
  • Can identify the port on which your Greenplum Database master server process is running, if it is not running on the default port (5432).
  • Have access to a running Kafka cluster with ZooKeeper, and that you can identify the hostname(s) and port number(s) of the Kafka broker(s) serving the data.
  • Can identify the Kafka topic of interest.
  • Can run the command on a host that has connectivity to:
    • Each Kafka broker host in the Kafka cluster.
    • The Greenplum Database master and all segment hosts.

Limitations

The Greenplum-Kafka Integration currently supports:
  • Loading from a single Kafka topic to a single Greenplum Database table. You must pre-create the Greenplum table.