Overview of the Greenplum Stream Server

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Overview of the Greenplum Stream Server

The Greenplum Stream Server (GPSS) is an ETL (extract, transform, load) tool. An instance of the GPSS server ingests streaming data from one or more clients, using Greenplum Database readable external tables to transform and insert the data into a target Greenplum table. The data source and the format of the data are specific to the client.

The Greenplum Stream Server includes the gpss command-line utility. When you run gpss, you start an instance of GPSS; this instance waits indefinitely for client data.

The Greenplum Stream Server also includes the gpsscli command-line utility, a client tool for submitting data load jobs to a GPSS instance and managing those jobs.

Note: The Greenplum Stream Server gpsscli client utility currently supports only a Kafka data source.

Architecture

The Greenplum Stream Server is a gRPC server. The GPSS gRPC service definition includes the operations and messages necessary to connect to Greenplum Database and examine Greenplum metadata. The service definition also includes the operations and messages necessary to write data from a client into a Greenplum Database table. For more information about gRPC, refer to the gRPC documentation.

The gpsscli utility is a Greenplum Stream Server gRPC client, as is the Greenplum-Kafka Integration and the Greenplum-Informatica Connector. You can develop your own GPSS gRPC client using the GPSS API.

Figure 1. Greenplum Stream Server Architecture

A typical sequence of events for performing an ETL task using the Greenplum Stream Server follows:

  1. A user initiates one or more ETL load jobs via a client application.
  2. The client application uses the gRPC protocol to submit and start data load job(s) to a running GPSS service instance.
  3. The GPSS service instance submits each load request transaction to the Greenplum Database cluster master instance, and creates or reuses external tables to store data.
  4. The GPSS service instance writes the data delivered from the client directly into the segments of the Greenplum Database cluster.

Limitations

The Greenplum Stream Server does not support loading data from multiple Kafka topics to the same Greenplum Database table. All jobs will hang if GPSS encounters this situation.

Refer to the Pivotal Greenplum-Kafka Integration Documentation for additional Kafka-related limitations.