CREATE EXTERNAL TABLE

CREATE EXTERNAL TABLE

Defines a new external table.

Synopsis

CREATE [READABLE] EXTERNAL TABLE table_name     
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION ('file://seghost[:port]/path/file' [, ...])
       | ('gpfdist://filehost[:port]/file_pattern[#transform=trans_name]'
           [, ...]
       | ('gpfdists://filehost[:port]/file_pattern[#transform=trans_name]'
           [, ...])
       | ('gphdfs://hdfs_host[:port]/path/file')
       | ('s3://S3_endpoint[:port]/bucket_name/[S3_prefix]
             [region=S3-region] [config=config_file]')
     FORMAT 'TEXT' 
           [( [HEADER]
              [DELIMITER [AS] 'delimiter' | 'OFF']
              [NULL [AS] 'null string']
              [ESCAPE [AS] 'escape' | 'OFF']
              [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
              [FILL MISSING FIELDS] )]
          | 'CSV'
           [( [HEADER]
              [QUOTE [AS] 'quote'] 
              [DELIMITER [AS] 'delimiter']
              [NULL [AS] 'null string']
              [FORCE NOT NULL column [, ...]]
              [ESCAPE [AS] 'escape']
              [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
              [FILL MISSING FIELDS] )]
          | 'AVRO' 
          | 'PARQUET'
          | 'CUSTOM' (Formatter=<formatter_specifications>)
    [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
      [ROWS | PERCENT] ]

CREATE [READABLE] EXTERNAL WEB TABLE table_name     
   ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('http://webhost[:port]/path/file' [, ...])
    | EXECUTE 'command' [ON ALL 
                          | MASTER
                          | number_of_segments
                          | HOST ['segment_hostname'] 
                          | SEGMENT segment_id ]
      FORMAT 'TEXT' 
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

CREATE WRITABLE EXTERNAL TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION('gpfdist://outputhost[:port]/filename[#transform=trans_name]'
          [, ...])
      | ('gpfdists://outputhost[:port]/file_pattern[#transform=trans_name]'
          [, ...])
      | ('gphdfs://hdfs_host[:port]/path')
      FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'AVRO' 
           | 'PARQUET'

           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

CREATE WRITABLE EXTERNAL TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION('s3://S3_endpoint[:port]/bucket_name/[S3_prefix]
            [region=S3-region] [config=config_file]')
      FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]

CREATE WRITABLE EXTERNAL WEB TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
    EXECUTE 'command' [ON ALL]
    FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

Description

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for detailed information about external tables.

CREATE EXTERNAL TABLE or CREATE EXTERNAL WEB TABLE creates a new readable external table definition in Greenplum Database. Readable external tables are typically used for fast, parallel data loading. Once an external table is defined, you can query its data directly (and in parallel) using SQL commands. For example, you can select, join, or sort external table data. You can also create views for external tables. DML operations (UPDATE, INSERT, DELETE, or TRUNCATE) are not allowed on readable external tables, and you cannot create indexes on readable external tables.

CREATE WRITABLE EXTERNAL TABLE or CREATE WRITABLE EXTERNAL WEB TABLE creates a new writable external table definition in Greenplum Database. Writable external tables are typically used for unloading data from the database into a set of files or named pipes. Writable external web tables can also be used to output data to an executable program. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations. Once a writable external table is defined, data can be selected from database tables and inserted into the writable external table. Writable external tables only allow INSERT operations – SELECT, UPDATE, DELETE or TRUNCATE are not allowed.

The main difference between regular external tables and web external tables is their data sources. Regular readable external tables access static flat files, whereas web external tables access dynamic data sources – either on a web server or by executing OS commands or scripts.

The FORMAT clause is used to describe how the external table files are formatted. Valid file formats are delimited text (TEXT) and comma separated values (CSV) format, similar to the formatting options available with the PostgreSQL COPY command. If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that the data in the external file is read correctly by Greenplum Database. For information about using a custom format, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.

For the gphdfs protocol, you can also specify the AVRO or PARQUET in the FORMAT clause to read or write Avro or Parquet format files. For information about Avro and Parquet files, see HDFS File Format Support for the gphdfs Protocol.

Before you can create an external table that writes to or reads from an Amazon Web Services (AWS) S3 bucket, you must configure Greenplum Database to support the protocol. S3 external tables can use either CSV or text-formatted files. Writable S3 external tables support only the INSERT operation. See s3 Protocol Configuration.

Parameters

READABLE | WRITABLE
Specifies the type of external table, readable being the default. Readable external tables are used for loading data into Greenplum Database. Writable external tables are used for unloading data.
WEB
Creates a readable or writable web external table definition in Greenplum Database. There are two forms of readable web external tables – those that access files via the http:// protocol or those that access data by executing OS commands. Writable web external tables output data to an executable program that can accept an input stream of data. Web external tables are not rescannable during query execution.
table_name
The name of the new external table.
column_name
The name of a column to create in the external table definition. Unlike regular tables, external tables do not have column constraints or default values, so do not specify those.
LIKE other_table
The LIKE clause specifies a table from which the new external table automatically copies all column names, data types and Greenplum distribution policy. If the original table specifies any column constraints or default column values, those will not be copied over to the new external table definition.
data_type
The data type of the column.
LOCATION ('protocol://host[:port]/path/file' [, ...])
For readable external tables, specifies the URI of the external data source(s) to be used to populate the external table or web table. Regular readable external tables allow the gpfdist or file protocols. Web external tables allow the http protocol. If port is omitted, port 8080 is assumed for http and gpfdist protocols, and port 9000 for the gphdfs protocol. If using the gpfdist protocol, the path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). Also, gpfdist can use wildcards or other C-style pattern matching (for example, a whitespace character is [[:space:]]) to denote multiple files in a directory. For example:
'gpfdist://filehost:8081/*'
'gpfdist://masterhost/my_load_file'
'file://seghost1/dbfast1/external/myfile.txt'
'http://intranet.example.com/finance/expenses.csv'
For the gphdfs protocol, the URI of the LOCATION clause cannot contain any of these four characters: \, ', <, >. The CREATE EXTERNAL TABLE command returns a an error if the URI contains any of the characters.
If you are using MapR clusters with the gphdfs protocol, you specify a specific cluster and the file:
  • To specify the default cluster, the first entry in the MapR configuration file /opt/mapr/conf/mapr-clusters.conf, specify the location of your table with this syntax:
     LOCATION ('gphdfs:///file_path')
    The file_path is the path to the file.
  • To specify another MapR cluster listed in the configuration file, specify the file with this syntax:
     LOCATION ('gphdfs:///mapr/cluster_name/file_path')
    The cluster_name is the name of the cluster specified in the configuration file and file_path is the path to the file.
For writable external tables, specifies the URI location of the gpfdist process or S3 protocol that will collect data output from the Greenplum segments and write it to one or more named files. For gpfdist the path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). If multiple gpfdist locations are listed, the segments sending data will be evenly divided across the available output locations. For example:
'gpfdist://outputhost:8081/data1.out',
'gpfdist://outputhost:8081/data2.out'
With two gpfdist locations listed as in the above example, half of the segments would send their output data to the data1.out file and the other half to the data2.out file.
With the option #transform=trans_name, you can specify a transform to apply when loading or extracting data. The trans_name is the name of the transform in the YAML configuration file you specify with the you run the gpfdist utility. For information about specifying a transform, see gpfdist in the Greenplum Utility Guide.
If you specify gphdfs protocol to read or write a file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the FORMAT clause with either AVRO or PARQUET.
For information about the options when specifying a location for an Avro or Parquet file, see HDFS File Format Support for the gphdfs Protocol.
When you create an external table with the s3 protocol, only the TEXT and CSV formats are supported. The files can be in gzip compressed format. The s3 protocol recognizes the gzip format and uncompress the files. Only the gzip compression format is supported.
You can specify the s3 protocol to access data on Amazon S3 or data on an Amazon S3 compatible service. Before you can create external tables with the s3 protocol, you must configure Greenplum Database. For information about configuring Greenplum Database, see s3 Protocol Configuration.
For the s3 protocol, the LOCATION clause specifies the S3 endpoint and bucket name where data files are uploaded for the table. For writable external tables you can specify an optional S3 file prefix to use when creating new files for data inserted to the table.
If you specify an S3_prefix for read-only s3 tables, the s3 protocol selects those all those files that have the specified S3 file prefix.
Note: Although the S3_prefix is an optional part of the syntax, you should always include an S3 prefix for both writable and read-only S3 tables to separate datasets as part of the CREATE EXTERNAL TABLE syntax.
The config parameter in the LOCATION clause specifies the location of the required s3 protocol configuration file that contains Amazon Web Services (AWS) connection credentials and communication parameters. For information about s3 configuration file parameters, see s3 Protocol Configuration File.
This is an example read-only external table definition with the s3 protocol.
CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
   LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.pivotal.io/dataset1/normal/
      config=/home/gpadmin/aws_s3/s3.conf')
   FORMAT 'csv';
The S3 bucket is at the S3 endpoint s3-us-west-2.amazonaws.com and the S3 bucket name is s3test.pivotal.io. The S3 prefix for the files in the bucket is /dataset1/normal/. The configuration file is in /home/gpadmin/s3/s3.conf on all Greenplum Database segments.
This example read-only external table specifies the same location and specifies Amazon S3 region us-west-2 with the region parameter.
CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
   LOCATION('s3://amazonaws.com/s3test.pivotal.io/dataset1/normal/
      region=s3-us-west-2 config=/home/gpadmin/aws_s3/s3.conf')
   FORMAT 'csv';
The port is optional in the URL of the LOCATION clause. If the port is not specified in the URL of the LOCATION clause, the s3 configuration file parameter encryption affects the port used by the s3 protocol (port 80 for HTTP or port 443 for HTTPS). If the port is specified, that port is used regardless of the encryption setting. For example, if port 80 is specified in the LOCATION clause and encryption=true in the s3 configuration file, HTTPS requests are sent to port 80 port instead of 443 and a warning is logged.
To specify an Amazon S3 compatible service in the LOCATION clause, set the s3 configuration file parameter version to 2 and specify the region parameter in the LOCATION clause. (If version to 2, the region parameter is required in the LOCATION clause.) When you define the service in the LOCATION clause, you specify the service endpoint in the URL and specify the service region in the region parameter. This is an example LOCATION clause that contains a region parameter and specifies an Amazon S3 compatible service.
LOCATION ('s3://test.company.com/s3test.company/dataset1/normal/ region=local-test
      config=/home/gpadmin/aws_s3/s3.conf')
When the version parameter is 2, you can also specify an Amazon S3 endpoint. This example specifies an Amazon S3 endpoint.
LOCATION ('s3://s3-us-west-2.amazonaws.com/s3test.pivotal.io/dataset1/normal/ region=us-west-2
      config=/home/gpadmin/aws_s3/s3.conf')
For information about the s3 protocol, see s3:// Protocol in the Greenplum Database Administrator Guide.
EXECUTE 'command' [ON ...]
Allowed for readable web external tables or writable external tables only. For readable web external tables, specifies the OS command to be executed by the segment instances. The command can be a single OS command or a script. The ON clause is used to specify which segment instances will execute the given command.
  • ON ALL is the default. The command will be executed by every active (primary) segment instance on all segment hosts in the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • ON MASTER runs the command on the master host only.
    Note: Logging is not supported for web external tables when the ON MASTER clause is specified.
  • ON number means the command will be executed by the specified number of segments. The particular segments are chosen randomly at runtime by the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • HOST means the command will be executed by one segment on each segment host (once per segment host), regardless of the number of active segment instances per host.
  • HOST segment_hostname means the command will be executed by all active (primary) segment instances on the specified segment host.
  • SEGMENT segment_id means the command will be executed only once by the specified segment. You can determine a segment instance's ID by looking at the content number in the system catalog table gp_segment_configuration. The content ID of the Greenplum Database master is always -1.

For writable external tables, the command specified in the EXECUTE clause must be prepared to have data piped into it. Since all segments that have data to send will write their output to the specified command or program, the only available option for the ON clause is ON ALL.

FORMAT 'TEXT | CSV | AVRO | PARQUET' (options)
Specifies the format of the external or web table data - either plain text (TEXT) or comma separated values (CSV) format.
The AVRO and PARQUET formats are supported only with the gphdfs protocol.
For information about the options when specifying the AVRO and PARQUET file format, see HDFS File Format Support for the gphdfs Protocol.
FORMAT 'CUSTOM' (formatter=formatter_specification)
Specifies a custom data format. The formatter_specification specifies the function to use to format the data, followed by comma-separated parameters to the formatter function. The length of the formatter specification, the string including Formatter=, can be up to approximately 50K bytes.
For information about using a custom format, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
DELIMITER
Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in TEXT mode, a comma in CSV mode. In TEXT mode for readable external tables, the delimiter can be set to OFF for special use cases in which unstructured data is loaded into a single-column table.
For the s3 protocol, the delimiter cannot be a newline character (\n) or a carriage return character (\r).
NULL
Specifies the string that represents a NULL value. The default is \N (backslash-N) in TEXT mode, and an empty value with no quotations in CSV mode. You might prefer an empty string even in TEXT mode for cases where you do not want to distinguish NULL values from empty strings. When using external and web tables, any data item that matches this string will be considered a NULL value.
As an example for the text format, this FORMAT clause can be used to specify that the string of two single quotes ('') is a NULL value.
FORMAT 'text' (delimiter ',' null '\'\'\'\'' )
ESCAPE
Specifies the single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a " (double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
NEWLINE
Specifies the newline used in your data files – LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CRLF (Carriage return plus line feed, 0x0D 0x0A). If not specified, a Greenplum Database segment will detect the newline type by looking at the first row of data it receives and using the first newline type encountered.
HEADER
For readable external tables, specifies that the first line in the data file(s) is a header row (contains the names of the table columns) and should not be included as data for the table. If using multiple data source files, all files must have a header row.
For the s3 protocol, the column names in the header row cannot contain a newline character (\n) or a carriage return (\r).
QUOTE
Specifies the quotation character for CSV mode. The default is double-quote (").
FORCE NOT NULL
In CSV mode, processes each specified column as though it were quoted and hence not a NULL value. For the default null string in CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.
FORCE QUOTE
In CSV mode for writable external tables, forces quoting to be used for all non-NULL values in each specified column. NULL output is never quoted.
FILL MISSING FIELDS
In both TEXT and CSV mode for readable external tables, specifying FILL MISSING FIELDS will set missing trailing field values to NULL (instead of reporting an error) when a row of data has missing data fields at the end of a line or row. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error.
ENCODING 'encoding'
Character set encoding to use for the external table. Specify a string constant (such as 'SQL_ASCII'), an integer encoding number, or DEFAULT to use the default client encoding. See Character Set Support.
LOG ERRORS [INTO error_table]
This is an optional clause that can precede a SEGMENT REJECT LIMIT clause to log information about rows with formatting errors. The optional INTO error_table clause specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode.
If the INTO error_table clause is not specified, the error log information is stored internally (not in an error table). Error log information that is stored internally is accessed with the Greenplum Database built-in SQL function gp_read_error_log().
If the error_table specified already exists, it is used. If it does not exist, it is created. If error_table exists and does not have a random distribution (the DISTRIBUTED RANDOMLY clause was not specified when creating the table), an error is returned.
See Notes for information about the error log information and built-in functions for viewing and managing error log information.
The error_table cannot be modified with the ALTER TABLE command.
Note: The optional INTO error_table clause is deprecated and will not be supported in the next major release. Only internal error logs will be supported.
SEGMENT REJECT LIMIT count [ROWS | PERCENT]
Runs a COPY FROM operation in single row error isolation mode. If the input rows have format errors they will be discarded provided that the reject limit count is not reached on any Greenplum segment instance during the load operation. The reject limit count can be specified as number of rows (the default) or percentage of total rows (1-100). If PERCENT is used, each segment starts calculating the bad row percentage only after the number of rows specified by the parameter gp_reject_percent_threshold has been processed. The default for gp_reject_percent_threshold is 300 rows. Constraint errors such as violation of a NOT NULL, CHECK, or UNIQUE constraint will still be handled in "all-or-nothing" input mode. If the limit is not reached, all good rows will be loaded and any error rows discarded.
Note: When reading an external table, Greenplum Database limits the initial number of rows that can contain formatting errors if the SEGMENT REJECT LIMIT is not triggered first or is not specified. If the first 1000 rows are rejected, the COPY operation is stopped and rolled back.

The limit for the number of initial rejected rows can be changed with the Greenplum Database server configuration parameter gp_initial_bad_row_limit. See Server Configuration Parameters for information about the parameter.

DISTRIBUTED BY (column, [ ... ] )
DISTRIBUTED RANDOMLY
Used to declare the Greenplum Database distribution policy for a writable external table. By default, writable external tables are distributed randomly. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. When you issue an unload command such as INSERT INTO wex_table SELECT * FROM source_table, the rows that are unloaded can be sent directly from the segments to the output location if the two tables have the same hash distribution policy.

Examples

Start the gpfdist file server program in the background on port 8081 serving files from directory /var/data/staging:

gpfdist -p 8081 -d /var/data/staging -l /home/gpadmin/log &

Create a readable external table named ext_customer using the gpfdist protocol and any text formatted files (*.txt) found in the gpfdist directory. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL. Also access the external table in single row error isolation mode:

CREATE EXTERNAL TABLE ext_customer
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.txt' ) 
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

Create the same readable external table definition as above, but with CSV formatted files:

CREATE EXTERNAL TABLE ext_customer 
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.csv' ) 
   FORMAT 'CSV' ( DELIMITER ',' );

Create a readable external table named ext_expenses using the file protocol and several CSV formatted files that have a header row:

CREATE EXTERNAL TABLE ext_expenses (name text, date date, 
amount float4, category text, description text) 
LOCATION ( 
'file://seghost1/dbfast/external/expenses1.csv',
'file://seghost1/dbfast/external/expenses2.csv',
'file://seghost2/dbfast/external/expenses3.csv',
'file://seghost2/dbfast/external/expenses4.csv',
'file://seghost3/dbfast/external/expenses5.csv',
'file://seghost3/dbfast/external/expenses6.csv' 
)
FORMAT 'CSV' ( HEADER );

Create a readable web external table that executes a script once per segment host:

CREATE EXTERNAL WEB TABLE log_output (linenum int, message 
text)  EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST 
 FORMAT 'TEXT' (DELIMITER '|');

Create a writable external table named sales_out that uses gpfdist to write output data to a file named sales.out. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL.

CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) 
   LOCATION ('gpfdist://etl1:8081/sales.out')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   DISTRIBUTED BY (txn_id);

Create a writable external web table that pipes output data received by the segments to an executable script named to_adreport_etl.sh:

CREATE WRITABLE EXTERNAL WEB TABLE campaign_out 
(LIKE campaign) 
 EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
 FORMAT 'TEXT' (DELIMITER '|');

Use the writable external table defined above to unload selected data:

INSERT INTO campaign_out SELECT * FROM campaign WHERE 
customer_id=123;

Notes

Greenplum database can log information about rows with formatting errors in an error table or internally. When you specify LOG ERRORS INTO error_table, Greenplum Database creates the table error_table that contains errors that occur while reading the external table. The table is defined as follows:

CREATE TABLE error_table_name ( cmdtime timestamptz, relname text, 
    filename text, linenum int, bytenum int, errmsg text, 
    rawdata text, rawbytes bytea ) DISTRIBUTED RANDOMLY;

You can view the information in the table with SQL commands.

For error log data that is stored internally when the INTO error_table is not specified:
  • Use the built-in SQL function gp_read_error_log('table_name'). It requires SELECT privilege on table_name.

    This example displays the error log information for data copied into the table ext_expenses:

    SELECT * from gp_read_error_log('ext_expenses');

    The error log contains the same columns as the error table.

    The function returns FALSE if table_name does not exist.

  • If error log data exists for the specified table, the new error log data is appended to existing error log data. The error log information is not replicated to mirror segments.
  • Use the built-in SQL function gp_truncate_error_log('table_name') to delete the error log data for table_name. It requires the table owner privilege. This example deletes the error log information captured when moving data into the table ext_expenses:
    SELECT gp_truncate_error_log('ext_expenses'); 

    The function returns FALSE if table_name does not exist.

    Specify the * wildcard character to delete error log information for existing tables in the current database. Specify the string *.* to delete all database error log information, including error log information that was not deleted due to previous database issues. If * is specified, database owner privilege is required. If *.* is specified, database super-user privilege is required.

  • When multiple Greenplum Database external tables are defined with the gpfdist, gpfdists, or file protocol and access the same named pipe a Linux system, Greenplum Database restricts access to the named pipe to a single reader. An error is returned if a second reader attempts to access the named pipe.

HDFS File Format Support for the gphdfs Protocol

If you specify gphdfs protocol to read or write file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the file format with the FORMAT clause.

To read data from or write data to an Avro or Parquet file, you create an external table with the CREATE EXTERNAL TABLE command and specify the location of the Avro file in the LOCATION clause and 'AVRO' in the FORMAT clause. This example is for a readable external table that reads from an Avro file.
CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'AVRO' 
The location can be a file name or a directory containing a set of files. For the file name you can specify the wildcard character * to match any number of characters. If the location specifies multiple files when reading files, Greenplum Database uses the schema in the first file that is read as the schema for the other files.

As part of the location parameter you can specify options for reading or writing the file. After the file name, you can specify parameters with the HTTP query string syntax that starts with ? and uses & between field value pairs.

For this example location parameter, this URL sets compression parameters for an Avro format writable external table.
'gphdfs://myhdfs:8081/avro/singleAvro/array2.avro?compress=true&compression_type=block&codec=snappy' FORMAT 'AVRO'

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for information about reading and writing the Avro and Parquet format files with external tables.

Avro Files

For readable external tables, the only valid parameter is schema. When reading multiple Avro files, you can specify a file that contains an Avro schema. See "Avro Schema Overrides" in the Greenplum Database Administrator Guide.

For writable external tables, you can specify schema, namespace, and parameters for compression.

Table 1. Avro Format External Table location Parameters
Parameter Value Readable/Writable Default Value
schema URL_to_schema_file Read and Write None.
For a readable external table
  • The specified schema overrides the schema in the Avro file. See "Avro Schema Overrides"
  • If not specified, Greenplum Database uses the Avro file schema.
For a writable external table
  • Uses the specified schema when creating the Avro file.
  • If not specified, Greenplum Database creates a schema according to the external table definition.
namespace avro_namespace Write only public.avro

If specified, a valid Avro namespace.

compress true or false Write only false
compression_type block Write only Optional.

For avro format, compression_type must be block if compress is true.

codec deflate or snappy Write only deflate
codec_level (deflate codec only) integer between 1 and 9 Write only 6

The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.

This set of parameters specify snappy compression:
 'compress=true&codec=snappy'

These two sets of parameters specify deflate compression and are equivalent:

 'compress=true&codec=deflate&codec_level=1'
 'compress=true&codec_level=1'

gphdfs Limitations for Avro Files

For a Greenplum Database writable external table definition, columns cannot specify the NOT NULL clause.

Greenplum Database supports only a single top-level schema in Avro files or specified with the schema parameter in the CREATE EXTERNAL TABLE command. An error is returned if Greenplum Database detects multiple top-level schemas.

Greenplum Database does not support the Avro map data type and returns an error when encountered.

When Greenplum Database reads an array from an Avro file, the array is converted to the literal text value. For example, the array [1,3] is converted to '{1,3}'.

User defined types (UDT), including array UDT, are supported. For a writable external table, the type is converted to string.

Parquet Files

For external tables, you can add parameters after the file specified in the location. This table lists the valid parameters and values.

Table 2. Parquet Format External Table location Parameters
Option Values Readable/Writable Default Value
schema URL_to_schema Write only None.

If not specified, Greenplum Database creates a schema according to the external table definition.

pagesize > 1024 Bytes Write only 1 MB
rowgroupsize > 1024 Bytes Write only 8 MB
version v1, v2 Write only v1
codec UNCOMPRESSED, GZIP, LZO, snappy Write only UNCOMPRESSED
dictionaryenable1 true, false Write only false
dictionarypagesize1 > 1024 Bytes Write only 512 KB
Note:
  1. Creates an internal dictionary. Enabling a dictionary might improve Parquet file compression if text columns contain similar or duplicate data.

s3 Protocol Configuration

The s3 protocol is used with a URI to specify the location of files in an Amazon Simple Storage Service (Amazon S3) bucket. The protocol creates or downloads all files specified by the LOCATION clause. Each Greenplum Database segment instance downloads or uploads one file at a time using several threads. See s3:// Protocol in the Greenplum Database Administrator Guide.

Configuring and Using S3 External Tables

Follow these basic steps to configure the S3 protocol and use S3 external tables, using the available links for more information. See also s3 Protocol Limitations to better understand the capabilities and limitations of S3 external tables:
  1. Configure each database to support the s3 protocol:
    1. In each database that will access an S3 bucket with the s3 protocol, create the read and write functions for the s3 protocol library:
      CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS
         '$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;
      CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
         '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;
    2. In each database that will access an S3 bucket, declare the s3 protocol and specify the read and write functions you created in the previous step:
      CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
      Note: The protocol name s3 must be the same as the protocol of the URL specified for the external table you create to access an S3 resource.

      The corresponding function is called by every Greenplum Database segment instance. All segment hosts must have access to the S3 bucket.

  2. On each Greenplum Database segment, create and install the s3 protocol configuration file:
    1. Create a template s3 protocol configuration file using the gpcheckcloud utility:
      gpcheckcloud -t > ./mytest_s3.config
    2. Edit the template file to specify the accessid and secret required to connect to the S3 location. See s3 Protocol Configuration File for information about other s3 protocol configuration parameters.
    3. Copy the file to the same location and filename for all Greenplum Database segments on all hosts. The default file location is gpseg_data_dir/gpseg_prefixN/s3/s3.conf. gpseg_data_dir is the path to the Greenplum Database segment data directory, gpseg_prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

      If you copy the file to a different location or filename, then you must specify the location with the config parameter in the s3 protocol URL. See About the s3 Protocol config Parameter.

    4. Use the gpcheckcloud utility to validate connectivity to the S3 bucket:
      gpcheckcloud -c "s3://<s3-endpoint>/<s3-bucket> config=./mytest_s3.config"
      Specify the correct path to the configuration file for your system, as well as the S3 endpoint name and bucket that you want to check. gpcheckcloud attempts to connect to the S3 endpoint and lists any files in the S3 bucket, if available. A successful connection ends with the message:
      Your configuration works well.
      You can optionally use gpcheckcloud to validate uploading to and downloading from the S3 bucket, as described in Using the gpcheckcloud Utility.
  3. After completing the previous steps to create and configure the s3 protocol, you can specify an s3 protocol URL in the CREATE EXTERNAL TABLE command to define S3 external tables. For read-only S3 tables, the URL defines the location and prefix used to select existing data files that comprise the S3 table. For example:
    CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
       LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
          config=/home/gpadmin/aws_s3/s3.conf')
       FORMAT 'csv';
    For writable S3 tables, the protocol URL defines the S3 location in which Greenplum database stores data files for the table, as well as a prefix to use when creating files for table INSERT operations. For example:
    CREATE WRITABLE EXTERNAL TABLE S3WRIT (LIKE S3TBL)
       LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
          config=/home/gpadmin/aws_s3/s3.conf')
       FORMAT 'csv';

    See About the S3 Protocol URL for more information.

s3 Protocol Limitations

These are s3 protocol limitations:
  • Only the S3 path-style URL is supported.
    s3://S3_endpoint/bucketname/[S3_prefix]
  • Only the S3 endpoint is supported. The protocol does not support virtual hosting of S3 buckets (binding a domain name to an S3 bucket).
  • AWS signature version 4 signing process is supported.

    For information about the S3 endpoints supported by each signing process, see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

  • Only a single URL and optional configuration file is supported in the LOCATION clause of the CREATE EXTERNAL TABLE command.
  • If the NEWLINE parameter is not specified in the CREATE EXTERNAL TABLE command, the newline character must be identical in all data files for specific prefix. If the newline character is different in some data files with the same prefix, read operations on the files might fail.
  • For writable S3 external tables, only the INSERT operation is supported. UPDATE, DELETE, and TRUNCATE operations are not supported.
  • Because Amazon S3 allows a maximum of 10,000 parts for multipart uploads, the maximum chunksize value of 128MB supports a maximum insert size of 1.28TB per Greenplum database segment for writable s3 tables. You must ensure that the chunksize setting can support the anticipated table size of your table. See Multipart Upload Overview in the S3 documentation for more information about uploads to S3.
  • To take advantage of the parallel processing performed by the Greenplum Database segment instances, the files in the S3 location for read-only S3 tables should be similar in size and the number of files should allow for multiple segments to download the data from the S3 location. For example, if the Greenplum Database system consists of 16 segments and there was sufficient network bandwidth, creating 16 files in the S3 location allows each segment to download a file from the S3 location. In contrast, if the location contained only 1 or 2 files, only 1 or 2 segments download data.

About the S3 Protocol URL

For the s3 protocol, you specify a location for files and an optional configuration file location in the LOCATION clause of the CREATE EXTERNAL TABLE command. This is the syntax:

's3://S3_endpoint[:port]/bucket_name/[S3_prefix] [region=S3_region] [config=config_file_location]'

For the s3 protocol, you must specify the S3 endpoint and S3 bucket name. Each Greenplum Database segment instance must have access to the S3 location. The optional S3_prefix value is used to select files for read-only S3 tables, or as a filename prefix to use when uploading files for S3 writable tables.

Note: The Greenplum Database s3 protocol URL must include the S3 endpoint hostname.

To specify an ECS endpoint (an Amazon S3 compatible service) in the LOCATION clause, you must set the s3 configuration file parameter version to 2. The version parameter controls whether the region parameter is used in the LOCATION clause. You can also specify an Amazon S3 location when the version parameter is 2. For information about version parameter, see s3 Protocol Configuration File.

Note: Although the S3_prefix is an optional part of the syntax, you should always include an S3 prefix for both writable and read-only S3 tables to separate datasets as part of the CREATE EXTERNAL TABLE syntax.

For writable S3 tables, the s3 protocol URL specifies the endpoint and bucket name where Greenplum Database uploads data files for the table. The S3 bucket permissions must be Upload/Delete for the S3 user ID that uploads the files. The S3 file prefix is used for each new file uploaded to the S3 location as a result of inserting data to the table. See About S3 Data Files.

For read-only S3 tables, the S3 file prefix is optional. If you specify an S3_prefix, then the s3 protocol selects all files that start with the specified prefix as data files for the external table. The s3 protocol does not use the slash character (/) as a delimiter, so a slash character following a prefix is treated as part of the prefix itself.

For example, consider the following 5 files that each have the S3_endpoint named s3-us-west-2.amazonaws.com and the bucket_name test1:

s3://s3-us-west-2.amazonaws.com/test1/abc
s3://s3-us-west-2.amazonaws.com/test1/abc/
s3://s3-us-west-2.amazonaws.com/test1/abc/xx
s3://s3-us-west-2.amazonaws.com/test1/abcdef
s3://s3-us-west-2.amazonaws.com/test1/abcdefff
  • If the S3 URL is provided as s3://s3-us-west-2.amazonaws.com/test1/abc, then the abc prefix selects all 5 files.
  • If the S3 URL is provided as s3://s3-us-west-2.amazonaws.com/test1/abc/, then the abc/ prefix selects the files s3://s3-us-west-2.amazonaws.com/test1/abc/ and s3://s3-us-west-2.amazonaws.com/test1/abc/xx.
  • If the S3 URL is provided as s3://s3-us-west-2.amazonaws.com/test1/abcd, then the abcd prefix selects the files s3://s3-us-west-2.amazonaws.com/test1/abcdef and s3://s3-us-west-2.amazonaws.com/test1/abcdefff

Wildcard characters are not supported in an S3_prefix; however, the S3 prefix functions as if a wildcard character immediately followed the prefix itself.

All of the files selected by the S3 URL (S3_endpoint/bucket_name/S3_prefix) are used as the source for the external table, so they must have the same format. Each file must also contain complete data rows. A data row cannot be split between files. The S3 file permissions must be Open/Download and View for the S3 user ID that is accessing the files.

For information about the Amazon S3 endpoints see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. For information about S3 buckets and folders, see the Amazon S3 documentation https://aws.amazon.com/documentation/s3/. For information about the S3 file prefix, see the Amazon S3 documentation Listing Keys Hierarchically Using a Prefix and Delimiter.

The config parameter specifies the location of the required s3 protocol configuration file that contains AWS connection credentials and communication parameters. See About the s3 Protocol config Parameter.

s3 Protocol Configuration File

When using the s3 protocol, an s3 protocol configuration file is required on all Greenplum Database segments. The default location is:
gpseg_data_dir/gpseg-prefixN/s3/s3.conf

The gpseg_data_dir is the path to the Greenplum Database segment data directory, the gpseg-prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

If you have multiple segment instances on segment hosts, you can simplify the configuration by creating a single location on each segment host. Then you can specify the absolute path to the location with the config parameter in the s3 protocol LOCATION clause. However, note that both read-only and writable S3 external tables use the same parameter values for their connections. If you want to configure protocol parameters differently for read-only and writable S3 tables, then you must use two different s3 protocol configuration files and specify the correct file in the CREATE EXTERNAL TABLE statement when you create each table.

This example specifies a single file location in the s3 directory of the gpadmin home directory:

config=/home/gpadmin/s3/s3.conf

All segment instances on the hosts use the file /home/gpadmin/s3/s3.conf.

The s3 protocol configuration file is a text file that consists of a [default] section and parameters This is an example configuration file:
[default]
secret = "secret"
accessid = "user access id"
threadnum = 3
chunksize = 67108864

You can use the Greenplum Database gpcheckcloud utility to test the S3 configuration file. See Using the gpcheckcloud Utility.

s3 Configuration File Parameters

accessid
Required. AWS S3 ID to access the S3 bucket.
secret
Required. AWS S3 passcode for the S3 ID to access the S3 bucket.
autocompress
For writable S3 external tables, this parameter specifies whether to compress files (using gzip) before uploading to S3. Files are compressed by default if you do not specify this parameter.
chunksize
The buffer size that each segment thread uses for reading from or writing to the S3 server. The default is 64 MB. The minimum is 8MB and the maximum is 128MB.

When inserting data to a writable S3 table, each Greenplum Database segment writes the data into its buffer (using multiple threads up to the threadnum value) until it is full, after which it writes the buffer to a file in the S3 bucket. This process is then repeated as necessary on each segment until the insert operation completes.

Because Amazon S3 allows a maximum of 10,000 parts for multipart uploads, the minimum chunksize value of 8MB supports a maximum insert size of 80GB per Greenplum database segment. The maximum chunksize value of 128MB supports a maximum insert size 1.28TB per segment. For writable S3 tables, you must ensure that the chunksize setting can support the anticipated table size of your table. See Multipart Upload Overview in the S3 documentation for more information about uploads to S3.

encryption
Use connections that are secured with Secure Sockets Layer (SSL). Default value is true. The values true, t, on, yes, and y (case insensitive) are treated as true. Any other value is treated as false.

If the port is not specified in the URL in the LOCATION clause of the CREATE EXTERNAL TABLE command, the configuration file encryption parameter affects the port used by the s3 protocol (port 80 for HTTP or port 443 for HTTPS). If the port is specified, that port is used regardless of the encryption setting.

low_speed_limit
The upload/download speed lower limit, in bytes per second. The default speed is 10240 (10K). If the upload or download speed is slower than the limit for longer than the time specified by low_speed_time, then the connection is aborted and retried. After 3 retries, the s3 protocol returns an error. A value of 0 specifies no lower limit.
low_speed_time
When the connection speed is less than low_speed_limit, this parameter specified the amount of time, in seconds, to wait before aborting an upload to or a download from the S3 bucket. The default is 60 seconds. A value of 0 specifies no time limit.
proxy
Specify a URL that is the proxy that S3 uses to connect to a data source. S3 supports these protocols: HTTP, HTTPS, and SOCKS (4, 4a, 5, 5h). This is the format for the parameter.
proxy = protocol://[user:password@]proxyhost[:port]
If this parameter is not set or is an empty string (proxy = ""), S3 uses the proxy specified by the environment variable http_proxy, https_proxy, or socks_proxy (and the environment variables all_proxy and no_proxy). The environment variable that S3 uses depends on the protocol. For information about the environment variables, see s3 Protocol Proxy Support.
There can be at most one proxy parameter in the configuration file. The URL specified by the parameter is the proxy for all supported protocols.
server_side_encryption
The S3 server-side encryption method that has been configured for the bucket. Greenplum Database supports only server-side encryption with Amazon S3-managed keys, identified by the configuration parameter value sse-s3. Server-side encryption is disabled (none) by default.
threadnum
The maximum number of concurrent threads a segment can create when uploading data to or downloading data from the S3 bucket. The default is 4. The minimum is 1 and the maximum is 8.
verifycert
Controls how the s3 protocol handles authentication when establishing encrypted communication between a client and an S3 data source over HTTPS. The value is either true or false. The default value is true.
  • verifycert=false - Ignores authentication errors and allows encrypted communication over HTTPS.
  • verifycert=true - Requires valid authentication (a proper certificate) for encrypted communication over HTTPS.
Setting the value to false can be useful in testing and development environments to allow communication without changing certificates.
Warning: Setting the value to false exposes a security risk by ignoring invalid credentials when establishing communication between a client and a S3 data store.
version
Specifies the version of the information specified in the LOCATION clause of the CREATE EXTERNAL TABLE command. The value is either 1 or 2. The default value is 1.
If the value is 1, the LOCATION clause supports an Amazon S3 URL, and does not contain the region parameter. If the value is 2, the LOCATION clause supports S3 compatible services and must include the region parameter. The region parameter specifies the S3 data source region. For this S3 URL s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/, the AWS S3 region is us-west-2.
If version is 1 or is not specified, this is an example of the LOCATION clause of the CREATE EXTERNAL TABLE command that specifies an Amazon S3 endpoint.
LOCATION ('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
      config=/home/gpadmin/aws_s3/s3.conf')
If version is 2, this is an example LOCATION clause with the region parameter for an AWS S3 compatible service.
LOCATION ('s3://test.company.com/s3test.company/test1/normal/ region=local-test
      config=/home/gpadmin/aws_s3/s3.conf') 
If version is 2, the LOCATION clause can also specify an Amazon S3 endpoint. This example specifies an Amazon S3 endpoint that uses the region parameter.
LOCATION ('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/ region=us-west-2
      config=/home/gpadmin/aws_s3/s3.conf') 
Note: Greenplum Database can require up to threadnum * chunksize memory on each segment host when uploading or downloading S3 files. Consider this s3 protocol memory requirement when you configure overall Greenplum Database memory, and increase the value of gp_vmem_protect_limit as necessary.

Compatibility

CREATE EXTERNAL TABLE is a Greenplum Database extension. The SQL standard makes no provisions for external tables.