Loading Data with the Streaming Server

Loading Data with the Streaming Server

You will perform the following tasks when you use the Greenplum Streaming Server (GPSS) to load data into a Greenplum Database table:

  1. Ensure that you meet the prerequisites, and that you have configured and started the Greenplum Streaming Server.
  2. Identify the source and the format of the data and construct the load configuration file (optional).
  3. Create the target Greenplum Database table.
  4. Assign Greenplum Database role permissions to the table, if required, as described in Configuring Greenplum Database Role Privileges.
  5. Run the GPSS client.
  6. Verify the load operation as described in Checking for Load Errors.

Constructing the Load Configuration File

Note: The Greenplum Streaming Server requires a load configuration file when you use the gpsscli or gpkafka client utilities to load data into Greenplum Database. A load configuration file is not required if you are using the Greenplum-Informatica Connector or a custom GPSS client application.

You configure a load operation from a data source to Greenplum Database via a YAML-formatted configuration file. This configuration file includes parameters that identify the data source and format, information about the Greenplum Database connection and target table, and error and commit thresholds for the operation.

Refer to Constructing the gpkafka.yaml Configuration File in the Greenplum-Kafka Integration Documentation for the YAML file format for a Kafka data source.

Creating the Target Greenplum Table

You are required to pre-create the target Greenplum table before you initiate a data load operation to Greenplum Database from a GPSS client. You must be able to identify both the schema name and table name of the target table.

Note: The column data types that you specify for the target Greenplum Database table are informed by the data formats supported by the GPSS client.

Configuring Greenplum Database Role Privileges

If you load data to Greenplum Database from a GPSS client using a non-admin Greenplum user/role, the Greenplum administrator must assign the role certain privileges:

  • The role must have USAGE and CREATE privileges on any non-public database schema where:
    • The role writes data to a table in the schema, or
    • gpss creates external tables.

    For example:

    =# GRANT USAGE, CREATE ON SCHEMA schema_name TO role_name;
  • If the role writing to Greenplum Database is not a database or table owner, the role must have SELECT and INSERT privileges on each Greenplum Database table to which the role will write data:
    =# GRANT SELECT, INSERT ON schema_name.table_name TO role_name;
  • The role must have permission to create readable external tables using the Greenplum Database gpfdist protocol:
    =# ALTER ROLE role_name CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist');

Refer to the Greenplum Database Managing Roles and Privileges documentation for further information on assigning privileges to Greenplum Database users.

Warning: Do not directly SELECT from an external table that GPSS creates for your job. Any data that you read in this manner will not be loaded into the Greenplum Database table.

Running the Client

You run a GPSS client to use the Greenplum Streaming Server to load data into Greenplum Database. Installation, configuration, and run procedures for a GPSS client are client-specific. For example, refer to the Pivotal Greenplum-Informatica Connector Documentation for information about this procedure for the Greenplum-Informatica Connector.

You can also use the gpsscli client command line utility to load data into Greenplum Database.

Using the gpsscli Client Utility

The Greenplum Streaming Server (GPSS) includes the gpsscli client command utility. gpsscli provides subcommands to manage Greenplum Streaming Server load jobs and to view job status, progress, and history:
Table 1. gpsscli Subcommands
Subcommand Description
help Display command help
history Show job history
list List jobs and their status
progress Show job progress
remove Remove a job
start Start a job
status Show job status
stop Stop a job
submit Submit a job
wait Wait for a job to stop

All subcommands include options that allow you to specify the host and/or port number of the GPSS instance that you want to service the request (--gpss-host and --gpss-port). You can also specify the directory to which GPSS writes gpsscli log files (--log-dir).

Note: The Greenplum-Kafka Integration includes a GPSS client command utility named gpkafka. gpkafka is a wrapper around the gpss and gpsscli utilities that provides data load capabilities to Greenplum from a Kafka data source. Refer to the Pivotal Greenplum-Kafka Integration Documentation for information about loading data from Kafka into Greenplum Database.

A typical command workflow when using gpsscli to load data into Greenplum Database follows:

  1. Submit a Greenplum Streaming Server job.
  2. Start the Greenplum Streaming Server job.
  3. (Optional) Check the status, progress, or history of the Greenplum Streaming Server job.
  4. (Optional) Wait for a Greenplum Streaming Server job to complete.
  5. Stop the Greenplum Streaming Server job.
  6. Remove the Greenplum Streaming Server job.

Alternatively, you can run a single-command load operation that submits a GPSS job on your behalf, starts the job, displays job progress, and may stop the GPSS job. See Running a Single-Command Load.

Submitting a Job

To register a data load operation to Greenplum Database, you submit a job to the Greenplum Streaming Server using the gpsscli submit subcommand. When you submit a job, you provide a YAML-formatted configuration file that defines the parameters of the load operation. Load parameters include Greenplum-specific options, as well as parameters that are specific to the data source. See gpsscli.yaml.

A GPSS job is identified by a name that you provide via the --name option, or by an identifier returned by the gpsscli submit command. You use this name to identify the job throughout its lifecycle.

The following example submits a GPSS job named nightly_order_upload whose load parameters are defined in the configuration file named loadcfg.yaml:

$ gpsscli submit --name nightly_order_upload loadcfg.yaml

A newly-submitted GPSS job is in the Stopped state.

Starting a Job

To start a GPSS job, you run the gpsscli start subcommand. When you start a job, GPSS initiates the data load operation from the client. It sets up the connection to Greenplum Database and creates the external tables that it uses to load data directly into Greenplum segments.

The following example starts the GPSS job named nightly_order_upload:

$ gpsscli start nightly_order_upload

A job that starts successfully enters the Running state.

The default behaviour of gpsscli start is to return immediately. When you specify the --quit-at-eof option, the command reads data until it receives an EOF, and then stops the job. In this scenario, the job transitions to the Stopped state when the command exits.

Checking Job Status, Progress, History

GPSS provides several commands to check the status of a running job(s):

  • The gpsscli list subcommand lists running (or all) jobs and their status:
    $ gpsscli list --all
    JobID                   GPHost       GPPort  DataBase    Schema     Table                    Topic         Status  
    monday_summary          sys1          5432   testdb      public     monday_totals            totals        Stopped 
    nightly_order_upload    sys1          5432   testdb      public     night_sync               orders        Running
  • The gpsscli status subcommand displays the status of a specific job:
    $ gpsscli status nightly_order_upload
    ... -[INFO]:-Job monday_summary, status Running, errmsg [], time 2019-10-15T21:56:47.766397Z

    Use the gpsscli status command to determine the status or success or failure of the operation. If the job status is Error, you will want to examine command output and log file messages for additional information. See Checking for Load Errors.

  • The gpsscli progress subcommand displays the progress of a running job. The command waits, and displays commit history at runtime. gpsscli progress returns when the job stops.
    $ gpsscli progress nightly_order_upload
    StartTime			EndTime				MsgNum     MsgSize  	InsertedRecords	RejectedRecords
    2019-10-15T21:56:49.950134Z	2019-10-15T21:56:49.964751Z	1000       78134        1000      	0         
    2019-10-15T21:56:49.976231Z	2019-10-15T21:56:49.984311Z	1000       77392        1000      	0         
    2019-10-15T21:56:49.993607Z	2019-10-15T21:56:50.003602Z	1000       77194        1000      	0         

    By default, gpsscli progress displays job progress by batch. To display job progress by partition, specify the --partition option to the subcommand:

    $ gpsscli progress nightly_order_upload --partition
    PartitionID	StartTime	                EndTime	                        BeginOffset	EndOffset	MsgSize
    0    		2019-10-15T21:56:54.80469Z	2019-10-15T21:56:54.830441Z	242000      	243000      	81033    
    0    		2019-10-15T21:56:54.846354Z	2019-10-15T21:56:54.880517Z	243000      	244000      	81021     
    0    		2019-10-15T21:56:54.893097Z	2019-10-15T21:56:54.904745Z	244000      	245000      	80504
  • The gpsscli history subcommand displays the commit history for a specific load operation. Notice that the input argment to this subcommand is the load configuration file, not the job name.
    The following example displays the complete commit history for the load operation defined by the configuration file named loadcfg.yaml:
    $ gpsscli history --show-commit-history all loadcfg.yaml
    PartitionID	StartTime	                EndTime	                        BeginOffset	EndOffset
    0	        2019-10-15T15:58:25.707977Z	2019-10-15T15:58:26.769737Z	15224	        16224
    0	        2019-10-15T15:58:26.813426Z	2019-10-15T15:58:26.824484Z	16224	        17224
    0	        2019-10-15T15:58:26.832965Z	2019-10-15T15:58:26.843036Z	17224	        18224

Waiting for a Job to Complete

You can use the gpsscli wait subcommand to wait for a running job to complete. A job is complete when there is no more data to read, or when an error is returned. Such jobs transition from the Running state to the Stopped state.

$ gpsscli wait	nightly_order_upload

gpsscli wait exits when the job completes.

Stopping a Job

Use the gpsscli stop subcommand to stop a specific job. When you stop a job, GPSS writes any unwritten batched data to the Greenplum Database table and stops actively reading new data from the data source.

$ gpsscli stop	nightly_order_upload

A job that you stop enters the Stopped state.

Removing a Job

The gpsscli remove subcommand removes a GPSS job. When you remove a job, GPSS unregisters the job from its job list and releases all job-related resources.

$ gpsscli remove nightly_order_upload

Running a Single-Command Load

The gpsscli load subcommand initiates a data load operation. When you run gpsscli load, GPSS submits, starts, and displays the progress of a job on your behalf.

By default, gpsscli load loads all available data and then waits indefinitely for new messages to load. In the case of user interrupt or exit, the GPSS job remains in the Running state. You must explicitly stop the job with gpsscli stop when running in this mode.

When you provide the --quit-at-eof option to the command, the utility exits after it reads all published data, writes the data to Greenplum Database, and stops the job. The GPSS job is in the Stopped state when the command returns.

Similar to the gpsscli submit command, gpsscli load takes as input a YAML-format configuration file that defines the load parameters:

$ gpsscli load --quit-at-eof loadcfg.yaml

Checking for Load Errors

The Greenplum Streaming Server cannot directly return success or an error to the client. You can obtain success and error information for a GPSS load operation from gpsscli subcommand output, and from messages that GPSS writes to stdout or writes to the server and/or client log files.

You can also view data formatting-specific errors encountered during a load operation in the error log.

Examining GPSS Log Files

GPSS writes server and client log messages to files as described in Managing GPSS Log Files.

GPSS writes log messages in the following format:

date:time proc:user:host:proc_pid-[severity]:-message

Example message in a gpss log file:

20181228:15:54:38.621 gpss:gpadmin:myhost:091692-[INFO]:-using config file: /home/gpadmin/gpsscfg.json

Example message in a gpsscli log file:

20190102:16:04:09.444 gpsscli:gpadmin:clihost:035661-[INFO]:-JobID: job_from_kafka1 is started

Determining Batch Load Status

To determine if GPSS loaded one or more batches of data to Greenplum Database successfully, first examine the status and progress of the job in question. The gpsscli status and gpsscli progress command output will identify if any known error conditions exist.

Also examine gpss command output and logs, searching for messages that identify the number of rows inserted and rejected:

... -[INFO]:- ... Inserted 9 rows
... -[INFO]:- ... Rejected 0 rows

Reading the Error Log

If gpss log output indicates that rows were rejected, you can run the following SQL query to view the data formatting errors that GPSS encountered while inserting data into a Greenplum Database table:
SELECT * FROM gp_read_error_log('gpss_external_table_name');

When you run the query, you specify the name of the external table that GPSS used for the load operation. You identify the name of the external table by examining the gpss command output and/or log file messages. For best results, use the (short) time interval identified in the gpss output. For example, to display the number of errors:

SELECT count(*) FROM gp_read_error_log('"public"."gpkafkaloadext_ae0eac9f8c94a487f30f749175c3afbf"')
    WHERE cmdtime > '2018-08-10 18:44:23.814651+00'
    AND cmdtime < '2018-08-10 18:44:24.140264+00';
Warning: Do not directly SELECT from an external table that GPSS creates for your job. Any data that you read in this manner will not be loaded into the Greenplum Database table.

Preventing External Table Reuse

GPSS creates a unique external table to load data for a specific job directly into Greenplum Database segments. By default, GPSS reuses this external table each time you restart the job. If the structure of either the source data or the destination Greenplum Database table is altered, GPSS may not be able to reuse the external table it initially created for the job.

You can configure GPSS to create a new external table for all new and restarted jobs submitted to a gpss service instance by setting the ReuseTables configuration property to false in the gpss.json file.

You can instruct GPSS to create a new external table for a specific job by specifying the --no-reuse option when you invoke the gpsscli start or the gpsscli load subcommand to resume a load operation.