gpkafka.yaml

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.

gpkafka.yaml

gpkafka configuration file.

Synopsis

DATABASE: db_name
USER: user_name
HOST: host
PORT: greenplum_port
KAFKA:
   INPUT:
      SOURCE:
        BROKERS: kafka_broker_host:broker_port [, ... ]
        TOPIC: kafka_topic
      COLUMNS:
        - NAME: column_name
          TYPE: column_data_type
        [ ... ]
      FORMAT: data_format
      ERROR_LIMIT: { num_errors | percentage_errors }
      [ LOCAL_HOSTNAME: local_hostname ]
      [ LOCAL_PORT: local_port ]
   OUTPUT:
      [ SCHEMA: schema_name ]
      TABLE: table_name
   COMMIT:
      MAX_ROW: num_rows
      MINIMAL_INTERVAL: wait_time

Description

You specify load configuration parameters for the gpkafka utilities in a YAML-formatted configuration file. (This reference page uses the name gpkafka.yaml when referring to this file; you may choose your own name for the file.) Load parameters include Greenplum Database connection and target table information, Kafka broker and topic information, and error and commit thresholds.

The gpkafka utility processes the YAML configuration file in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant.

Note: The gpkafka.yaml keywords are case-sensitive.

Keywords and Values

Greenplum Database Connection Options
DATABASE: db_name
The name of the Greenplum database.
USER: user_name
The name of the Greenplum Database user/role. This user_name must have permissions as described in Configuring Greenplum Database Role Privileges.
HOST: host
The host name or IP address of the Greenplum Database master host.
PORT: greenplum_port
The port number of the Greenplum Database server on the master host.
KAFKA:INPUT: Options
SOURCE
Kafka input configuration parameters.
BROKERS: kafka_broker_host:broker_port
The host and port identifying the Kafka broker.
TOPIC: kafka_topic
The name of the Kafka topic from which to load data. The topic must exist.
COLUMNS:
The column names and data types.
Note: The gpkafka utility currently uses the structure defined for the target Greenplum Database TABLE when it loads the Kafka data.
NAME: column_name
The name of a column.
TYPE: data_type
The data type of the column.
FORMAT: data_format
The format of the Kafka data, csv or text.
Note: gpkafka supports delimited text format data only; it treats the csv and text data_formats equivalently. The message content cannot contain both a delimiter and line ending characters (CR and LF).
ERROR_LIMIT: { num_errors | percentage_errors }
The error threshold, specified as either an absolute number or a percentage. gpkafka load exits when this limit is reached.
LOCAL_HOSTNAME: local_hostname
The name of the local host on which you are running gpkafka. This host should be DNS resolvable from each Greenplum Database segment host. The default value is the output of hostname -f, a short host name.
Note: You must explicitly set LOCAL_HOSTNAME if you need the FQDN of the host.
LOCAL_PORT: local_port
The gpfdist port number on the local host. The default value is 8080.
KAFKA:OUTPUT: Options
SCHEMA: schema_name
The name of the Greenplum Database schema. The default schema is the public schema.
TABLE: table_name
The name of the Greenplum Database table into which to load the Kafka data.
Greenplum Database COMMIT: Options
COMMIT:
Controls how gpkafka load commits data to Greenplum Database. You must specify one of MAX_ROW or MINIMAL_INTERVAL. You may specify both configuration parameters as long as both values are not zero (0).
MAX_ROW: number_of_rows
The number of rows to batch before triggering an INSERT operation on the Greenplum Database table. The default value of MAX_ROW is 0, which instructs gpkafka to ignore this commit trigger condition.
MINIMAL_INTERVAL: wait_time
The minimum amount of time to wait (milliseconds) between each INSERT operation on the table. The default value is 0, wait forever.

Notes

If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the gpkafka.yaml configuration file. For example, if you create a table as follows:

CREATE TABLE "MyTable" ("MyColumn" text);

Your gpkafka.yaml YAML configuration file would refer to the above table and column names as:

  COLUMNS:
     - name: '"MyColumn"'
       type: text
OUTPUT:
   TABLE: '"MyTable"'

Examples

Load data from Kafka as defined in the configuration file named kafka2greenplum.yaml:

gpkafka load kafka2greenplum.yaml

Example kafka2greenplum.yaml configuration file:

DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
KAFKA:
   INPUT:
      SOURCE:
         BROKERS: kbrokerhost1:9092
         TOPIC: customer_expenses
      COLUMNS:
         - NAME: cust_id
           TYPE: int
         - NAME: month
           TYPE: int
         - NAME: expenses
           TYPE: decimal(9,2)
      FORMAT: csv
      ERROR_LIMIT: 25
   OUTPUT:
      SCHEMA: payables
      TABLE: expenses
   COMMIT:
      MAX_ROW: 1000
      MINIMAL_INTERVAL: 30000