CREATE EXTERNAL TABLE

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.

CREATE EXTERNAL TABLE

Defines a new external table.

Synopsis

CREATE [READABLE] EXTERNAL TABLE table_name     
    ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('file://seghost[:port]/path/file' [, ...])
        | ('gpfdist://filehost[:port]/file_pattern[#transform]' [, ...]
        | ('gpfdists://filehost[:port]/file_pattern[#transform]'
            [, ...])
        | ('gphdfs://hdfs_host[:port]/path/file')
        | ('s3://S3_endpoint/bucket_name/[S3_prefix]
            [config=config_file]')
      FORMAT 'TEXT' 
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'AVRO' 
           | 'PARQUET'

           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

CREATE [READABLE] EXTERNAL WEB TABLE table_name     
   ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('http://webhost[:port]/path/file' [, ...])
    | EXECUTE 'command' [ON ALL 
                          | MASTER
                          | number_of_segments
                          | HOST ['segment_hostname'] 
                          | SEGMENT segment_id ]
      FORMAT 'TEXT' 
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

CREATE WRITABLE EXTERNAL TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION('gpfdist://outputhost[:port]/filename[#transform]'
      | ('gpfdists://outputhost[:port]/file_pattern[#transform]'
          [, ...])
      | ('gphdfs://hdfs_host[:port]/path')
      FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'AVRO' 
           | 'PARQUET'

           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

CREATE WRITABLE EXTERNAL WEB TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
    EXECUTE 'command' [ON ALL]
    FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

Description

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for detailed information about external tables.

CREATE EXTERNAL TABLE or CREATE EXTERNAL WEB TABLE creates a new readable external table definition in Greenplum Database. Readable external tables are typically used for fast, parallel data loading. Once an external table is defined, you can query its data directly (and in parallel) using SQL commands. For example, you can select, join, or sort external table data. You can also create views for external tables. DML operations (UPDATE, INSERT, DELETE, or TRUNCATE) are not allowed on readable external tables, and you cannot create indexes on readable external tables.

CREATE WRITABLE EXTERNAL TABLE or CREATE WRITABLE EXTERNAL WEB TABLE creates a new writable external table definition in Greenplum Database. Writable external tables are typically used for unloading data from the database into a set of files or named pipes. Writable external web tables can also be used to output data to an executable program. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations. Once a writable external table is defined, data can be selected from database tables and inserted into the writable external table. Writable external tables only allow INSERT operations – SELECT, UPDATE, DELETE or TRUNCATE are not allowed.

The main difference between regular external tables and web external tables is their data sources. Regular readable external tables access static flat files, whereas web external tables access dynamic data sources – either on a web server or by executing OS commands or scripts.

The FORMAT clause is used to describe how the external table files are formatted. Valid file formats are delimited text (TEXT) and comma separated values (CSV) format, similar to the formatting options available with the PostgreSQL COPY command. If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that the data in the external file is read correctly by Greenplum Database. For information about using a custom format, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.

For the gphdfs protocol, you can also specify the AVRO or PARQUET in the FORMAT clause to read or write Avro or Parquet format files. For information about Avro or Parquet file, see HDFS File Format Support for the gphdfs Protocol.

When accessing files from an Amazon Web Services (AWS) S3 bucket with the s3 protocol, only readable external tables are supported. Before you can create an external table that reads from the S3 bucket, you must configure Greenplum Database to support the protocol. See s3 Protocol Configuration.

Parameters

READABLE | WRITABLE
Specifies the type of external table, readable being the default. Readable external tables are used for loading data into Greenplum Database. Writable external tables are used for unloading data.
WEB
Creates a readable or writeable web external table definition in Greenplum Database. There are two forms of readable web external tables – those that access files via the http:// protocol or those that access data by executing OS commands. Writable web external tables output data to an executable program that can accept an input stream of data. Web external tables are not rescannable during query execution.
table_name
The name of the new external table.
column_name
The name of a column to create in the external table definition. Unlike regular tables, external tables do not have column constraints or default values, so do not specify those.
LIKE other_table
The LIKE clause specifies a table from which the new external table automatically copies all column names, data types and Greenplum distribution policy. If the original table specifies any column constraints or default column values, those will not be copied over to the new external table definition.
data_type
The data type of the column.
LOCATION ('protocol://host[:port]/path/file' [, ...])
For readable external tables, specifies the URI of the external data source(s) to be used to populate the external table or web table. Regular readable external tables allow the gpfdist or file protocols. Web external tables allow the http protocol. If port is omitted, port 8080 is assumed for http and gpfdist protocols, and port 9000 for the gphdfs protocol. If using the gpfdist protocol, the path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). Also, gpfdist can use wildcards or other C-style pattern matching (for example, a whitespace character is [[:space:]]) to denote multiple files in a directory. For example:
'gpfdist://filehost:8081/*'
'gpfdist://masterhost/my_load_file'
'file://seghost1/dbfast1/external/myfile.txt'
'http://intranet.example.com/finance/expenses.csv'
If you are using MapR clusters with the gphdfs protocol, you specify a specific cluster and the file:
  • To specify the default cluster, the first entry in the MapR configuration file /opt/mapr/conf/mapr-clusters.conf, specify the location of your table with this syntax:
     LOCATION ('gphdfs:///file_path')
    The file_path is the path to the file.
  • To specify another MapR cluster listed in the configuration file, specify the file with this syntax:
     LOCATION ('gphdfs:///mapr/cluster_name/file_path')
    The cluster_name is the name of the cluster specified in the configuration file and file_path is the path to the file.
For writable external tables, specifies the URI location of the gpfdist process that will collect data output from the Greenplum segments and write it to the named file. The path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). If multiple gpfdist locations are listed, the segments sending data will be evenly divided across the available output locations. For example:
'gpfdist://outputhost:8081/data1.out',
'gpfdist://outputhost:8081/data2.out'
With two gpfdist locations listed as in the above example, half of the segments would send their output data to the data1.out file and the other half to the data2.out file.
If you specify gphdfs protocol to read or write file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the FORMAT clause with either AVRO or PARQUET.
For information about the options when specifying a location for an Avro or Parquet file, see HDFS File Format Support for the gphdfs Protocol.
When you create a readable external table with the s3 protocol, only the TEXT and CSV formats are supported. The files can be in gzip compressed format. The s3 protocol recognizes the gzip format and uncompress the files. Only the gzip compression format is supported.
Before you can create external tables with the s3 protocol, you must configure Greenplum Database. For information about configuring Greenplum Database, see s3 Protocol Configuration.
For the s3 protocol, you specify a location for files and an optional configuration file in the LOCATION clause. This is the syntax of the clause.
's3://S3_endpoint/bucket_name/[S3_prefix] [config=config_file_location]'
The AWS S3_endpoint and S3 bucket_name define the location of the files. If needed, the S3_prefix specify the group of files in the bucket. For information about the S3 endpoints see the Amazon S3 documentation http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. For information about S3 buckets and folders, see the Amazon S3 documentation http://aws.amazon.com/documentation/s3/.
If you specify an S3_prefix, the s3 protocol selects the files that have the specified S3 file prefix. The s3 protocol does not use the slash character (/) as delimiter. For example, these files have domain as the S3_endpoint, and test1 as the bucket_name.
s3://domain/test1/abc
s3://domain/test1/abc/
s3://domain/test1/abc/xx
s3://domain/test1/abcdef
s3://domain/test1/abcdefff
  • If the file location is s3://domain/test1/abc, the s3 protocol selects all 5 files.
  • If the file location is s3://domain/test1/abc/, the s3 protocol selects the files s3://domain/test1/abc/ and s3://domain/test1/abc/xx.
  • If the file location is s3://domain/test1/abcd the s3 protocol selects the files s3://domain/test1/abcdef and s3://domain/test1/abcdefff
Wildcard characters are not supported in a S3_prefix.
For information about the S3 prefix, see the Amazon S3 documentation Listing Keys Hierarchically Using a Prefix and Delimiter.
All the files specified by the S3 file location (S3_endpoint/bucket_name/S3_prefix) are used as the source for the external table and must have the same format and each file must contain complete data rows. A data row cannot be split between files. The S3 file permissions must be Open/Download and View for the S3 user ID that is accessing the files.
The config parameter specifies the location of the required s3 protocol configuration file that contains AWS connection credentials and communication parameters. See s3 Protocol Configuration File.
This is an example readable external table definition with the s3 protocol.
CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
   location('s3://s3-us-west-2.amazonaws.com/s3test.pivotal.io/dataset1/normal/
      config=/home/gpadmin/aws_s3/s3.conf')
   FORMAT 'csv';
The S3 bucket is at the S3 endpoint s3-us-west-2.amazonaws.com and the S3 bucket name is s3test.pivotal.io. The S3 prefix for the files in the bucket is /dataset1/normal/. The configuration file is in /home/gpadmin/s3/s3.conf on all Greenplum Database segments.
EXECUTE 'command' [ON ...]
Allowed for readable web external tables or writable external tables only. For readable web external tables, specifies the OS command to be executed by the segment instances. The command can be a single OS command or a script. The ON clause is used to specify which segment instances will execute the given command.
  • ON ALL is the default. The command will be executed by every active (primary) segment instance on all segment hosts in the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • ON MASTER runs the command on the master host only.
    Note: Logging is not supported for web external tables when the ON MASTER clause is specified.
  • ON number means the command will be executed by the specified number of segments. The particular segments are chosen randomly at runtime by the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • HOST means the command will be executed by one segment on each segment host (once per segment host), regardless of the number of active segment instances per host.
  • HOST segment_hostname means the command will be executed by all active (primary) segment instances on the specified segment host.
  • SEGMENT segment_id means the command will be executed only once by the specified segment. You can determine a segment instance's ID by looking at the content number in the system catalog table gp_segment_configuration. The content ID of the Greenplum Database master is always -1.

For writable external tables, the command specified in the EXECUTE clause must be prepared to have data piped into it. Since all segments that have data to send will write their output to the specified command or program, the only available option for the ON clause is ON ALL.

FORMAT 'TEXT | CSV | AVRO | PARQUET' (options)
Specifies the format of the external or web table data - either plain text (TEXT) or comma separated values (CSV) format.
The AVRO and PARQUET formats are supported only with the gphdfs protocol.
For information about the options when specifying the AVRO and PARQUET file format, see HDFS File Format Support for the gphdfs Protocol.
DELIMITER
Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in TEXT mode, a comma in CSV mode. In TEXT mode for readable external tables, the delimiter can be set to OFF for special use cases in which unstructured data is loaded into a single-column table.
NULL
Specifies the string that represents a NULL value. The default is \N (backslash-N) in TEXT mode, and an empty value with no quotations in CSV mode. You might prefer an empty string even in TEXT mode for cases where you do not want to distinguish NULL values from empty strings. When using external and web tables, any data item that matches this string will be considered a NULL value.
As an example for the text format, this FORMAT clause can be used to specify that the string of two single quotes ('') is a NULL value.
FORMAT 'text' (delimiter ',' null '\'\'\'\'' )
ESCAPE
Specifies the single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a " (double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
NEWLINE
Specifies the newline used in your data files – LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CRLF (Carriage return plus line feed, 0x0D 0x0A). If not specified, a Greenplum Database segment will detect the newline type by looking at the first row of data it receives and using the first newline type encountered.
HEADER
For readable external tables, specifies that the first line in the data file(s) is a header row (contains the names of the table columns) and should not be included as data for the table. If using multiple data source files, all files must have a header row.
QUOTE
Specifies the quotation character for CSV mode. The default is double-quote (").
FORCE NOT NULL
In CSV mode, processes each specified column as though it were quoted and hence not a NULL value. For the default null string in CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.
FORCE QUOTE
In CSV mode for writable external tables, forces quoting to be used for all non-NULL values in each specified column. NULL output is never quoted.
FILL MISSING FIELDS
In both TEXT and CSV mode for readable external tables, specifying FILL MISSING FIELDS will set missing trailing field values to NULL (instead of reporting an error) when a row of data has missing data fields at the end of a line or row. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error.
ENCODING 'encoding'
Character set encoding to use for the external table. Specify a string constant (such as 'SQL_ASCII'), an integer encoding number, or DEFAULT to use the default client encoding. See Character Set Support.
LOG ERRORS [INTO error_table]
This is an optional clause that can precede a SEGMENT REJECT LIMIT clause to log information about rows with formatting errors. The optional INTO error_table clause specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode.
If the INTO error_table clause is not specified, the error log information is stored internally (not in an error table). Error log information that is stored internally is accessed with the Greenplum Database built-in SQL function gp_read_error_log().
If the error_table specified already exists, it is used. If it does not exist, it is created. If error_table exists and does not have a random distribution (the DISTRIBUTED RANDOMLY clause was not specified when creating the table), an error is returned.
See Notes for information about the error log information and built-in functions for viewing and managing error log information.
The error_table cannot be modified with the ALTER TABLE command.
Note: The optional INTO error_table clause is deprecated and will not be supported in the next major release. Only internal error logs will be supported.
SEGMENT REJECT LIMIT count [ROWS | PERCENT]
Runs a COPY FROM operation in single row error isolation mode. If the input rows have format errors they will be discarded provided that the reject limit count is not reached on any Greenplum segment instance during the load operation. The reject limit count can be specified as number of rows (the default) or percentage of total rows (1-100). If PERCENT is used, each segment starts calculating the bad row percentage only after the number of rows specified by the parameter gp_reject_percent_threshold has been processed. The default for gp_reject_percent_threshold is 300 rows. Constraint errors such as violation of a NOT NULL, CHECK, or UNIQUE constraint will still be handled in "all-or-nothing" input mode. If the limit is not reached, all good rows will be loaded and any error rows discarded.
Note: When reading an external table, Greenplum Database limits the initial number of rows that can contain formatting errors if the SEGMENT REJECT LIMIT is not triggered first or is not specified. If the first 1000 rows are rejected, the COPY operation is stopped and rolled back.

The limit for the number of initial rejected rows can be changed with the Greenplum Database server configuration parameter gp_initial_bad_row_limit. See Server Configuration Parameters for information about the parameter.

DISTRIBUTED BY (column, [ ... ] )
DISTRIBUTED RANDOMLY
Used to declare the Greenplum Database distribution policy for a writable external table. By default, writable external tables are distributed randomly. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. When you issue an unload command such as INSERT INTO wex_table SELECT * FROM source_table, the rows that are unloaded can be sent directly from the segments to the output location if the two tables have the same hash distribution policy.

Examples

Start the gpfdist file server program in the background on port 8081 serving files from directory /var/data/staging:

gpfdist -p 8081 -d /var/data/staging -l /home/gpadmin/log &

Create a readable external table named ext_customer using the gpfdist protocol and any text formatted files (*.txt) found in the gpfdist directory. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL. Also access the external table in single row error isolation mode:

CREATE EXTERNAL TABLE ext_customer
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.txt' ) 
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

Create the same readable external table definition as above, but with CSV formatted files:

CREATE EXTERNAL TABLE ext_customer 
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.csv' ) 
   FORMAT 'CSV' ( DELIMITER ',' );

Create a readable external table named ext_expenses using the file protocol and several CSV formatted files that have a header row:

CREATE EXTERNAL TABLE ext_expenses (name text, date date, 
amount float4, category text, description text) 
LOCATION ( 
'file://seghost1/dbfast/external/expenses1.csv',
'file://seghost1/dbfast/external/expenses2.csv',
'file://seghost2/dbfast/external/expenses3.csv',
'file://seghost2/dbfast/external/expenses4.csv',
'file://seghost3/dbfast/external/expenses5.csv',
'file://seghost3/dbfast/external/expenses6.csv' 
)
FORMAT 'CSV' ( HEADER );

Create a readable web external table that executes a script once per segment host:

CREATE EXTERNAL WEB TABLE log_output (linenum int, message 
text)  EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST 
 FORMAT 'TEXT' (DELIMITER '|');

Create a writable external table named sales_out that uses gpfdist to write output data to a file named sales.out. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL.

CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) 
   LOCATION ('gpfdist://etl1:8081/sales.out')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   DISTRIBUTED BY (txn_id);

Create a writable external web table that pipes output data received by the segments to an executable script named to_adreport_etl.sh:

CREATE WRITABLE EXTERNAL WEB TABLE campaign_out 
(LIKE campaign) 
 EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
 FORMAT 'TEXT' (DELIMITER '|');

Use the writable external table defined above to unload selected data:

INSERT INTO campaign_out SELECT * FROM campaign WHERE 
customer_id=123;

Notes

Greenplum database can log information about rows with formatting errors in an error table or internally. When you specify LOG ERRORS INTO error_table, Greenplum Database creates the table error_table that contains errors that occur while reading the external table. The table is defined as follows:

CREATE TABLE error_table_name ( cmdtime timestamptz, relname text, 
    filename text, linenum int, bytenum int, errmsg text, 
    rawdata text, rawbytes bytea ) DISTRIBUTED RANDOMLY;

You can view the information in the table with SQL commands.

For error log data that is stored internally when the INTO error_table is not specified:
  • Use the built-in SQL function gp_read_error_log('table_name'). It requires SELECT privilege on table_name.

    This example displays the error log information for data copied into the table ext_expenses:

    SELECT * from gp_read_error_log('ext_expenses');

    The error log contains the same columns as the error table.

    The function returns FALSE if table_name does not exist.

  • If error log data exists for the specified table, the new error log data is appended to existing error log data. The error log information is not replicated to mirror segments.
  • Use the built-in SQL function gp_truncate_error_log('table_name') to delete the error log data for table_name. It requires the table owner privilege. This example deletes the error log information captured when moving data into the table ext_expenses:
    SELECT gp_truncate_error_log('ext_expenses'); 

    The function returns FALSE if table_name does not exist.

    Specify the * wildcard character to delete error log information for existing tables in the current database. Specify the string *.* to delete all database error log information, including error log information that was not deleted due to previous database issues. If * is specified, database owner privilege is required. If *.* is specified, database super-user privilege is required.

HDFS File Format Support for the gphdfs Protocol

If you specify gphdfs protocol to read or write file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the file format with the FORMAT clause.

To read data from or write data to an Avro or Parquet file, you create an external table with the CREATE EXTERNAL TABLE command and specify the location of the Avro file in the LOCATION clause and 'AVRO' in the FORMAT clause. This example is for a readable external table that reads from an Avro file.
CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'AVRO' 
The location can be a file name or a directory containing a set of files. For the file name you can specify the wildcard character * to match any number of characters. If the location specifies multiple files when reading files, Greenplum Database uses the schema in the first file that is read as the schema for the other files.

As part of the location parameter you can specify options for reading or writing the file. After the file name, you can specify parameters with the HTTP query string syntax that starts with ? and uses & between field value pairs.

For this example location parameter, this URL sets compression parameters for an Avro format writeable external table.
'gphdfs://myhdfs:8081/avro/singleAvro/array2.avro?compress=true&compression_type=block&codec=snappy' FORMAT 'AVRO'

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for information about reading and writing the Avro and Parquet format files with external tables.

Avro Files

For readable external tables, the only valid parameter is schema. When reading multiple Avro files, you can specify a file that contains an Avro schema. See "Avro Schema Overrides" in the Greenplum Database

For writable external tables, you can specify schema, namespace, and parameters for compression.

Table 1. Avro Format External Table location Parameters
Parameter Value Readable/Writeable Default Value
schema URL_to_schema_file Read and Write None.
For a readable external table
  • The specified schema overrides the schema in the Avro file. See "Avro Schema Overrides"
  • If not specified, Greenplum Database uses the Avro file schema.
For a writeable external table
  • Uses the specified schema when creating the Avro file.
  • If not specified, Greenplum Database creates a schema according to the external table definition.
namespace avro_namespace Write only public.avro

If specified, a valid Avro namespace.

compress true or false Write only false
compression_type block Write only Optional.

For avro format, compression_type must be block if compress is true.

codec deflate or snappy Write only deflate
codec_level (deflate codec only) integer between 1 and 9 Write only 6

The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.

This set of parameters specify snappy compression:
 'compress=true&codec=snappy'

These two sets of parameters specify deflate compression and are equivalent:

 'compress=true&codec=deflate&codec_level=1'
 'compress=true&codec_level=1'

Parquet Files

For external tables, you can add parameters after the file specified in the location. This table lists the valid parameters and values.

Table 2. Parquet Format External Table location Parameters
Option Values Readable/Writeable Default Value
schema URL_to_schema Write only None.

If not specified, Greenplum Database creates a schema according to the external table definition.

pagesize > 1024 Bytes Write only 1 MB
rowgroupsize > 1024 Bytes Write only 8 MB
version v1, v2 Write only v1
codec UNCOMPRESSED, GZIP, LZO, snappy Write only UNCOMPRESSED
dictionaryenable1 true, false Write only false
dictionarypagesize1 > 1024 Bytes Write only 512 KB
Note:
  1. Creates an internal dictionary. Enabling a dictionary might improve Parquet file compression if text columns contain similar or duplicate data.

s3 Protocol Configuration

The s3 protocol is used with a URI to specify the location of files in an Amazon Simple Storage Service (Amazon S3) bucket. The protocol downloads all files specified by the LOCATION clause. Each Greenplum Database segment instance downloads one file at a time using several threads. The segment instances download files until all files have been downloaded.

Before you create a readable external table with the s3 protocol, you must configure the Greenplum Database system.
  • Configure the database to support the s3 protocol.
  • Create and install the s3 protocol configuration file on all the Greenplum Database segments.

To configure a database to support the s3 protocol

  1. Create a function to access the s3 protocol library.

    In each Greenplum database that accesses an S3 bucket with the s3 protocol, create a function for the protocol:

    CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
       '$libdir/gps3ext.so', 's3_import'
    LANGUAGE C STABLE;
  2. Declare the s3 protocol and specify the function that is used to read from an S3 bucket.
    CREATE PROTOCOL s3 (readfunc = read_from_s3);
Note: The protocol name s3 must be the same as the protocol of the URL specified for the readable external table you create to access an S3 resource.

The function is called by every segment. All Greenplum Database segments must have access to the S3 bucket.

To create and install the s3 protocol configuration file

  1. Create a configuration file with the S3 configuration information.
  2. Install the file in the same location for all Greenplum Database segments on all hosts.

    You can specify the location of the file with the config parameter in the LOCATION clause of the EXTERNAL TABLE command. This is the default location:

    gpseg_data-dir/gpseg-prefixN/s3/s3.conf

    The gpseg-data-dir is the path to the Greenplum Database segment data directory, the gpseg-prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

    If you have multiple segment instances on segment hosts, you can simplify the configuration by creating a single location on each segment host. Then you specify the absolute path to the location with the config parameter in the s3 protocol LOCATION clause.

s3 Protocol Limitations

These are s3 protocol limitations:
  • Only the S3 path-style URL is supported.
    s3://S3_endpoint/bucketname/[S3_prefix]
  • Only the S3 endpoint is supported. The protocol does not support virtual hosting of S3 buckets (binding a domain name to an S3 bucket).
  • AWS signature version 2 and version 4 signing process are supported.

    For information about the S3 endpoints supported by each signing process, see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

  • S3 encryption is not supported. The S3 file property Server Side Encryption must be None.
Note: To take advantage of the parallel processing performed by the Greenplum Database segment instances, the files in the S3 location should be similar in size and the number of files should allow for multiple segments to download the data from the S3 location. For example, if the Greenplum Database system consists of 16 segments and there was sufficient network bandwidth, creating 16 files in the S3 location allows each segment to download a file from the S3 location. In contrast, if the location contained only 1 or 2 files, only 1 or 2 segments download data.

s3 Protocol Configuration File

When using the s3 protocol, the s3 protocol configuration file is required on all Greenplum Database segments. The default location is
gpseg_data_dir/gpseg-prefixN/s3/s3.conf

The gpseg_data_dir is the path to the Greenplum Database segment data directory, the gpseg-prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

If you have multiple segment instances on segment hosts, you can simplify the configuration by creating a single location on each segment host. Then you specify the absolute path to the location with the config parameter in the s3 protocol LOCATION clause. This example specifies a location in the gpadmin home directory.

config=/home/gpadmin/s3/s3.conf

All segment instances on the hosts use the file /home/gpadmin/s3/s3.conf.

The s3 protocol configuration file is a text file that consists of a [default] section and parameters. This is an example configuration file.
[default]
secret = "secret"
accessid = "user access id"
connections = 3
chunksize = 67108864

You can test your s3 configuration file and the ability for your host to connect to an S3 bucket with the Greenplum Database utility gpcheckcloud. For information about the utility, see "s3:// Protocol" in the Greenplum Database Administrator Guide.

s3 Configuration File Parameters

accessid
Required. AWS S3 ID to access the S3 bucket.
secret
Required. AWS S3 passcode for the S3 ID to access the S3 bucket.
chunksize
The buffer size for each segment thread. The default is 64MB. The minimum is 8MB and the maximum is128MB.
threadnum
The maximum number of concurrent connections a segment can create when downloading data from the S3 bucket. The default is 4. The minimum is 1 and the maximum is 8.
encryption
Use connections that are secured with Secure Sockets Layer (SSL). Default value is true. The values true, t, on, yes, and y (case insensitive) are treated as true. Any other value is treated as false.
low_speed_limit
The download speed lower limit, in bytes per second. The default speed is 10240 (10K). If the download speed is slower than the limit for longer than the time specified by low_speed_time, the download connection is aborted and retried. After 3 retries, the s3 protocol returns an error. A value of 0 specifies no lower limit.
low_speed_time
When the connection speed is less than low_speed_limit, the amount of time, in minutes, to wait before aborting a download from an S3 bucket. The default is 1 minute. A value of 0 specifies no time limit.
Note: You must ensure that there is sufficient memory on the Greenplum Database segment hosts when the s3 protocol to accesses the files. Greenplum Database allocates connections * chunksize memory on each segment host when accessing S3 files.

Compatibility

CREATE EXTERNAL TABLE is a Greenplum Database extension. The SQL standard makes no provisions for external tables.