Understanding Custom Formatters

Understanding Custom Formatters

This topic describes custom formatters and how to use them with Greenplum Streaming Server.

A custom formatter is a C function that performs specific formatting or processing on data that is accessed by a Greenplum Database external table. A custom formatter may support options that you provide to direct the processing performed by the function. Greenplum includes built-in and a custom formatters. You can also develop your own custom formatter.

You compile the C custom formatter functions that you develop into a shared library. These functions are available to Greenplum Database users after the shared library is installed in the Greenplum Database cluster and the custom formatter functions are registered as SQL UDFs.

This topic includes the following sections:

Developing a Custom Formatter for GPSS

A custom formatter is a PostgreSQL C language function; refer to C-Language Functions in the PostgreSQL documentation for detailed information about developing C language functions.

Important header files for custom formatter development include: postgres.h, access/formatter.h, and fmgr.h. These headers define the functions and macros required to interact with PostgreSQL and formatter C structures.

For an example Greenplum Database custom formatter implementation, refer to the formatter_fixedwidth example in the Greenplum Database open source github repository.

See Example: Custom Formatter for Kafka for a GPSS- and Kafka-specific custom formatter example.

You can develop and test a custom formatter against a Greenplum Database external table that specifies the file: protocol in the LOCATION URI. Any custom formatter that you develop and test in this fashion should be compatible with GPSS.

The remainder of this topic describes special considerations when developing a custom formatter for use with GPSS.

About Data Boundaries

GPSS handles data boundaries from the source; a custom formatter can expect to receive a complete Kafka message. Certain formatter API calls are expected to behave differently than a typical Greenplum Database formatter:

  • FORMATTER_SET_DATACURSOR() has no effect.
  • FORMATTER_GET_DATALEN() always returns the full size of the message. The message is guaranteed to be complete.
  • FORMATTER_GET_DATACURSOR() always returns 0.
  • GPSS throws an error when the custom formatter returns FMT_NEED_MORE_DATA.

Handling Bad Data

When the custom formatter encounters an unrecoverable error, it should invoke ereport() with the error code ERRCODE_INTERNAL_ERROR (or its siblings) to indicate that the process should be terminated. The bad row data will not be written to the error log in this case.

If the custom formatter encounters an ignorable error and the data loading should continue, it should invoke ereport() with the error code ERRCODE_DATA_EXCEPTION (or its siblings). In this scenario, GPSS writes the bad row data to the error log automatically.

The GPSS extension invokes the FORMATTER_SET_BAD_ROW_DATA() function; the function has no effect when invoked by the custom formatter.

Known Issues

Greenplum Database truncates bad row data written to the error log at the first 0x00 byte. As a result, the gp_read_error_log() and gp_read_persistent_error_log() functions may return an incomplete rawbytes.

Building the Custom Formatter Shared Library with PGXS

You compile the custom formatter function that you write into a shared library that the Greenplum Database server loads on demand.

You can use the PostgreSQL build extension infrastructure (PGXS) to build the source code for your custom formatter function against a Greenplum Database installation. This framework automates common build rules for simple modules. If you have a more complicated use case, you must write your own build system.

To use the PGXS infrastructure to generate a shared library for a custom formatter function that you develop, create a simple Makefile that sets PGXS-specific variables.

Note: Refer to Extension Building Infrastructure in the PostgreSQL documentation for information about the Makefile variables supported by PGXS.
For example, the following Makefile generates a shared library named customfmt_example.so from a C source file named customfmt.c:
MODULE_big = customfmt_example
OBJS = customfmt.o
PG_CPPFLAGS = -I$(shell $(PG_CONFIG) --includedir)
SHLIB_LINK = -L$(shell $(PG_CONFIG) --libdir)

PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

MODULE_big identifes the base name of the shared library generated by the Makefile.

PG_CPPFLAGS adds the Greenplum Database installation include directory to the compiler header file search path.

SHLIB_LINK adds the Greenplum Database installation library directory to the linker search path.

The PG_CONFIG and PGXS variable settings and the include statement are required and typically reside in the last three lines of the Makefile.

Registering the Custom Formatter Function with Greenplum Database

Before you can use a custom formatter, you must register the function with Greenplum Database.

Registering the function involves mapping the function signature to a SQL user-defined function. You define this mapping with the CREATE FUNCTION .. AS command specifying the shared library name. You may choose to use the same name or differing names for the custom formatter and SQL functions.

Sample CREATE FUNCTION ... AS syntax follows:
CREATE FUNCTION sql_function_name(arg[, ...]) RETURNS return_type
  AS 'shared_library_path'[, 'formatter_function_name']

You may omit the shared library .so extension when you specify shared_library_path.

The following command registers a C function named customfmt_import() to a SQL UDF named customfmt_in() when the function is compiled and linked in a shared library named customfmt_example.so:
CREATE FUNCTION customfmt_in() RETURNS record
  AS 'customfmt_example.so', 'customfmt_import'

Using a Custom Formatter in GPSS

You can use a custom formatter in your GPSS load job by identifying the custom data format and providing the formatter function name and parameters in the load configuration file. The names of these properties differ in version 2 and version 3 (Beta) format configuration files. Version 2 format example:

FORMAT: custom
  NAME: formatter_in
  PARAMSTR: aaa="test",bbb="123"

Version 3 (Beta) format example:

    - name: value
      type: text
  name: formatter_in
    - aaa="test"
    - bbb="123" 

When you specify a custom formatter in your GPSS load configuration file, GPSS invokes the formatter to process the data before loading it into Greenplum Database.