gpsscli-v3.yaml (Beta)

gpsscli-v3.yaml (Beta)

GPSS load configuration file (version 3).

Synopsis

target:
  host: host
  port: greenplum_port
  user: user_name
  password: password
  database: db_name
  schema: schema_name
  table: table_name
source:
  DATASOURCE:
    DATASOURCE_specific_properties
channel:
  gpdb_channel:
    mode:
      # specify a single mode property block (described below)
      insert:
        mode_specific_property: value
        ...
      update:
        mode_specific_property: value
        ...
      merge:
        mode_specific_property: value
        ...
    work_schema: work_schema_name
    error_limit: num_errors | percentage_errors
    window:
      batch:
        max_count: number_of_rows
        interval_ms: wait_time
      window_size: num_batches
      window_statement: udf_or_sql_to_run
    mapping:
      target_column_name : source_column_name | expression
      ...
option:
  name: job_name
  save_failing_batch: boolean
  schedule:
    max_retries: num_retries
    retry_interval: retry_time

Where the mode_specific_propertys that you can specify for each mode follow:

insert:
  filter_expression: filter_string
update:
  filter_expression: filter_string
  match_columns: [match_column_names]
  order_columns: [order_column_names]
  update_columns: [update_column_names]
  update_condition: update_condition
merge:
  filter_expression: filter_string
  match_columns: [match_column_names]
  update_columns: [update_column_names]
  order_columns: [order_column_names]
  update_condition: update_condition
  delete_condition: delete_condition

Description

Note: Version 3 of the GPSS load configuration file is different in both content and format than previous versions of the file. Certain symbols used in the GPSS version 1 and 2 configuration file reference page syntax have different meanings in version 3 syntax:
  • Brackets [] are literal and are used to specify a list in version 3. They are no longer used to signify the optionality of a property.
  • Curly braces {} are literal and are used to specify YAML mappings in version 3 syntax. They are no longer used with the pipe symbol (|) to identify a list of choices.

You specify the configuration properties for a Greenplum Streaming Server (GPSS) job in a YAML-formatted configuration file that you provide to the gpsscli submit or gpsscli load command. There are three types of configuration information in this file - Greenplum Database connection and data import properties, properties specific to the data source from which you will load data into Greenplum, and properties specific to the GPSS job.

This reference page uses the name gpsscli-v3.yaml to refer to this file; you may choose your own name for the file.

The gpsscli utility processes the YAML configuration file in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.

You can use the gpsscli convert command to convert a V2 load configuration file to V3 syntax.

Keywords and Values

Greenplum Database target Options
host: host
The host name or IP address of the Greenplum Database master host.
port: greenplum_port
The port number of the Greenplum Database server on the master host.
user: user_name
The name of the Greenplum Database user/role. This user_name must have permissions as described in the Configuring Greenplum Database Role Privileges.
password: password
The password for the Greenplum Database user/role.
database: db_name
The name of the Greenplum database.
schema: schema_name
The name of the Greenplum Database schema in which table_name resides. Optional, the default schema is the public schema.
table: table_name
The name of the Greenplum Database table into which GPSS loads the data.
source: Options
source:
The data source.
DATASOURCE
GPSS currently supports the file and kafka data sources.
DATASOURCE_specific_properties
Configuration properties specific to the file or kafka data source; refer to filesource-v3.yaml (Beta) and gpkafka-v3.yaml (Beta) for version 3 configuration file format and properties for these sources.
channel:gpdb_channel Options
mode:
The table load mode; insert, merge, or update. The default mode is insert.
Note: update and merge are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
insert:
Inserts source data into Greenplum.
update:
Updates the target table columns that are listed in update_columns when the input columns identified in match_columns match the named target table columns and the optional update_condition is true.
merge:
Inserts new rows and updates existing rows when:
  • columns are listed in update_columns,
  • the match_columns target table column values are equal to the input data, and
  • an optional update_condition is specified and met.
Deletes rows when:
  • the match_columns target table column values are equal to the input data, and
  • an optional delete_condition is specified and met.
New rows are identified when the match_columns value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the match_columns and update_columns. If there are multiple new match_columns values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specify order_columns, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.
mode_property_name: value
The name to value mapping for a mode property. Each mode supports one or more of the following properties as specified in the Synopsis.
filter_expression: filter_string
The filter to apply to the input data before GPSS loads the data into Greenplum Database. If the filter evaluates to true, GPSS loads the message. If the filter evaluates to false, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more source value, key, or meta column names.
match_columns: [match_column_names]
A comma-separated list that specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
Required when mode is merge or update.
order_columns: [order_column_names]
A comma-separated list that specifies the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch, order_columns is used with match_columns to determine the input row with the largest value; GPSS uses that row to write/update the target.
Optional. May be specified in merge mode to sort the input data rows.
update_columns: [update_column_names]
A column-sparated list that specifies the column(s) to update for the rows that meet the match_columns criteria and the optional update_condition.
Required when mode is merge or update.
update_condition: update_condition
Specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a merge). Optional.
delete_condition: delete_condition
In merge mode, specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met for GPSS to delete rows in the target table that meet the match_columns criteria. Optional.
work_schema: work_schema_name
The name of the Greenplum Database schema in which GPSS creates internal tables. The default work_schema_name is public.
error_limit: num_errors | percentage_errors
The error threshold, specified as either an absolute number or a percentage. GPSS stops running the job when this limit is reached.
window:
The batch size and commit window.
batch:
Controls how GPSS commits data to Greenplum Database. You may specify both configuration properties as long as both values are not zero (0). Try setting and tuning interval_ms to your environment; introduce a max_count setting only if you encounter high memory usage associated with message buffering.
max_count: number_of_rows
The number of rows to batch before triggering an INSERT operation on the Greenplum Database table. The default value of max_count is 0, which instructs GPSS to ignore this commit trigger condition.
interval_ms: wait_time
The minimum amount of time to wait (milliseconds) between each INSERT operation on the table. The default value is 5000.
window_size: num_batches
The number of batches to read before executing window_statement. The default batch interval is 0.
window_statement: udf_or_sql_to_run
A user-defined function or SQL command(s) that you want to run after GPSS reads window_size number of batches. The default is null, no command to execute.
mapping:
Optional. Overrides the default source-to-target column mapping.
Note: When you specify a mapping, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide a mapping block.
target_column_name: source_column_name | expression
target_column_name specifies the target Greenplum Database table column name. GPSS maps this column name to the source column name specified in source_column_name, or to an expression. When you specify an expression, you may provide a value expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
Job option: Options
name: job_name
Identifies the name of the job.
save_failing_batch: boolean
Determines whether or not GPSS saves data into a backup table before it writes the data to Greenplum Database. Saving the data in this manner aids recovery when GPSS encounters errors during the evaluation of expressions. The default is false; GPSS does not use a backup table, and returns immediately when it encounters an expression error. When you set this property to true, GPSS writes both the good and the bad data in the batch to a backup table named gpssbackup_jobhash, and continues to process incoming data. You must then manually load the good data from the backup table into Greenplum.
Note: Using a backup table to hedge against mapping errors may impact performance, especially when the data that you are loading has not been cleaned.
.
schedule:
Controls the frequency and interval of restarting failed jobs.
retry_interval: retry_time
The period of time that GPSS waits before retrying the job. You can specify the time interval in day (d), hour (h), minute (m), second (s), or millisecond (ms) integer units; do not mix units. The default retry interval is 5m (5 minutes).
max_retries: num_retries
The maximum number of times that GPSS attempts to retry the job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.

Notes

If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:

CREATE TABLE "MyTable" (c1 text);

Your YAML configuration file would refer to the table name as:

target:
   table: '"MyTable"'

Examples

Submit a job to load data into Greenplum Database as defined in the v3 load configuration file named loadit_v3.yaml:

$ gpsscli submit loadit_v3.yaml

Example Greenplum Database configuration properties in loadit_v3.yaml:

target:
  host: gphost
  port: 5432
  user: gpadmin
  password: changeme
  database: testdb
  schema: public
  table: order_history
source:
  kafka:
    kafka_specific_properties
channel:
  gpdb_channel:
    work_schema: public
    error_limit: 25%