Overview of the Greenplum-Kafka Connector

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Overview of the Greenplum-Kafka Connector

Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Kafka is a fault-tolerant, low-latency, distributed publish-subscribe message system. The Pivotal Greenplum-Kafka Connector provides high speed, parallel data transfer from Apache Kafka to Greenplum Database to support a streaming ETL pipeline.

The Pivotal Greenplum-Kafka Connector supports the Apache and Confluent Kafka distributions. Refer to the Apache Kafka Documentation for more information about Apache Kafka.

Kafka stores streams of messages (or records) in categories called topics. A Kafka producer publishes records to partitions in one or more topics. A Kafka consumer subscribes to a topic and receives records in the order that they were sent within a given Kafka partition. Kafka does not guarantee the order of data originating from different Kafka partitions.

The Greenplum-Kafka Connector includes the gpkafka utility. This utility has two subcommands:

  • gpkafka load - load Kafka data into Greenplum
  • gpkafka check - check the commit history of a load operation

The gpkafka utility is a Kafka consumer. It ingests streaming data from a single Kafka topic, using Greenplum Database readable external tables to transform and insert the data into a target Greenplum table. You identify the Kafka source, data format, and the Greenplum connection options and target table definition in a configuration file that you provide to the utility. In the case of user interrupt or exit, gpkafka resumes a subsequent data load operation specifying the same Kafka topic and target Greenplum Database table names from the last recorded offset.

Prerequisites

The Greenplum-Kafka Connector is installed when you install Greenplum Database. Before using the Connector to load Kafka data to Greenplum Database, ensure that you:

  • Have access to a running Greenplum Database cluster, and that you can identify the hostname of your master node.
  • Can identify the port on which your Greenplum Database master server process is running, if it is not running on the default port (5432).
  • Have access to a running Kafka cluster with ZooKeeper, and that you can identify the hostname(s) and port number(s) of the Kafka broker(s) serving the data.
  • Can identify the Kafka topic of interest.
  • Can run the command on a host that has connectivity to:
    • Each Kafka broker host in the Kafka cluster.
    • The Greenplum Database master and all segment hosts.

Limitations

The Greenplum-Kafka Connector currently supports:
  • Delimited text format data only. The message content cannot contain both a delimiter and line ending characters (CR and LF).
  • Loading from a single Kafka topic to a single Greenplum Database table. You must pre-create the Greenplum table.

The Greenplum-Kafka Connector currently uses only 2 CPU cores. This limitation will be improved in a future release.