s3:// Protocol

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.

s3:// Protocol

The s3 protocol is used in a URL that specifies the location of an Amazon Simple Storage Service (Amazon S3) bucket. Amazon S3 provides secure, durable, highly-scalable object storage. For information about Amazon S3, see Amazon S3.

Before creating an external table with the s3 protocol you must configure Greenplum Database. See s3 Protocol Prerequisites.

For the s3 protocol, you specify a location for files and an optional configuration file in the LOCATION clause of the CREATE EXTERNAL TABLE command. This is the syntax.

's3://S3_endpoint/bucket_name/[S3_prefix] [config=config_file_location]'

The s3 protocol URL specifies the AWS S3 endpoint, S3 bucket name, and optional S3 file prefix.

For information about the AWS S3 endpoints see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. For information about S3 buckets and folders, see the Amazon S3 documentation http://aws.amazon.com/documentation/s3/.

If you specify an S3_prefix, the s3 protocol selects the files that have the specified S3 file prefix. The s3 protocol does not use the slash character (/) as delimiter. For example, these files have domain as the S3_endpoint, and test1 as the bucket_name.

s3://domain/test1/abc
s3://domain/test1/abc/
s3://domain/test1/abc/xx
s3://domain/test1/abcdef
s3://domain/test1/abcdefff
  • If the file location is s3://domain/test1/abc, the s3 protocol selects all 5 files.
  • If the file location is s3://domain/test1/abc/, the s3 protocol selects the files s3://domain/test1/abc/ and s3://domain/test1/abc/xx.
  • If the file location is s3://domain/test1/abcd, the s3 protocol selects the files s3://domain/test1/abcdef and s3://domain/test1/abcdefff

Wildcard characters are not supported in a S3_prefix.

For information about the S3 prefix, see the Amazon S3 documentation Listing Keys Hierarchically Using a Prefix and Delimiter.

About S3 Data Files

All the files specified by the S3 file location (S3_endpoint/bucket_name/S3_prefix) are used as the source for the external table and must have the same format and each file must contain complete data rows. A data row cannot be split between files. Only the TEXT and CSV formats are supported. The files can be in gzip compressed format. The s3 protocol recognizes the gzip format and uncompress the files. Only the gzip compression format is supported.

The S3 file permissions must be Open/Download and View for the S3 user ID that is accessing the files.

The config parameter specifies the location of the required s3 protocol configuration file that contains AWS connection credentials and communication parameters.

Each Greenplum Database segment instance must have access to the S3 location. Each segment can download one file at a time from S3 location using several threads. To take advantage of the parallel processing performed by the Greenplum Database segments, the files in the S3 location should be similar in size and the number of files should allow for multiple segments to download the data from the S3 location. For example, if the Greenplum Database system consists of 16 segments and there was sufficient network bandwidth, creating 16 files in the S3 location allows each segment to download a file from the S3 location. In contrast, if the location contained only 1 or 2 files, only 1 or 2 segments download data.

About the S3 Protocol config Parameter

The optional config parameter specifies the location of the required s3 protocol configuration file. The file contains AWS connection credentials and communication parameters. For information about the file, see s3 Protocol Configuration File.

The configuration file is required on all Greenplum Database segment hosts. This is default location is a location in the data directory of each Greenplum Database segment instance.
gpseg_data_dir/gpseg_prefixN/s3/s3.conf

The gpseg_data_dir is the path to the Greenplum Database segment data directory, the gpseg_prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

If you have multiple segment instances on segment hosts, you can simplify the configuration by creating a single location on each segment host. Then you specify the absolute path to the location with the config parameter in the s3 protocol LOCATION clause. This example specifies a location in the gpadmin home directory.

LOCATION ('s3://s3.amazonaws.com/test/my_data config=/home/gpadmin/s3.conf')

All segment instances on the hosts use the file /home/gpadmin/s3/s3.conf.

s3 Protocol Prerequisites

Before you create a readable external table with the s3 protocol, you must configure the Greenplum Database system.
  • Configure the database to support the s3 protocol.
  • Create and install the s3 protocol configuration file on all the Greenplum Database segments.

To configure a database to support the s3 protocol

  1. Create a function to access the s3 protocol library.

    In each Greenplum database that accesses an S3 bucket with the s3 protocol, create a function for the protocol:

    CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
       '$libdir/gps3ext.so', 's3_import'
    LANGUAGE C STABLE;
  2. Declare the s3 protocol and specify the function that is used to read from an S3 bucket.
    CREATE PROTOCOL s3 (readfunc = read_from_s3);
Note: The protocol name s3 must be the same as the protocol of the URL specified for the readable external table you create to access an S3 resource.

The function is called by every Greenplum Database segment instance. All segment hosts must have access to the S3 bucket.

To create and install the s3 protocol configuration file
  1. Create a configuration file with the S3 configuration information.
  2. Install the file in the same location for all Greenplum Database segments on all hosts.

    The default location is gpseg_data_dir/gpseg_prefixN/s3/s3.conf. If you can install the file in a different location, you must specify the location with the config parameter in the s3 protocol URL. See About the S3 Protocol config Parameter.

s3 Protocol Configuration File

When using the s3 protocol, the s3 protocol configuration file is required on all Greenplum Database segments. The default location is
gpseg_data_dir/gpseg-prefixN/s3/s3.conf

The gpseg_data_dir is the path to the Greenplum Database segment data directory, the gpseg-prefix is the segment prefix, and N is the segment ID. The segment data directory, prefix, and ID are set when you initialize a Greenplum Database system.

If you have multiple segment instances on segment hosts, you can simplify the configuration by creating a single location on each segment host. Then you specify the absolute path to the location with the config parameter in the s3 protocol LOCATION clause. This example specifies a location in the gpadmin home directory.

config=/home/gpadmin/s3/s3.conf

All segment instances on the hosts use the file /home/gpadmin/s3/s3.conf.

The s3 protocol configuration file is a text file that consists of a [default] section and parameters This is an example configuration file.
[default]
secret = "secret"
accessid = "user access id"
connections = 3
chunksize = 67108864

You can use the Greenplum Database gpcheckcloud utility to test the S3 configuration file. See About the gpcheckcloud Utility.

s3 Configuration File Parameters

accessid
Required. AWS S3 ID to access the S3 bucket.
secret
Required. AWS S3 passcode for the S3 ID to access the S3 bucket.
chunksize
The buffer size for each segment thread. The default is 64 MB. The minimum is 8MB and the maximum is128MB.
threadnum
The maximum number of concurrent connections a segment can create when downloading data from the S3 bucket. The default is 4. The minimum is 1 and the maximum is 8.
encryption
Use connections that are secured with Secure Sockets Layer (SSL). Default value is true. The values true, t, on, yes, and y (case insensitive) are treated as true. Any other value is treated as false.
low_speed_limit
The download speed lower limit, in bytes per second. The default speed is 10240 (10K). If the download speed is slower than the limit for longer than the time specified by low_speed_time, the download connection is aborted and retried. After 3 retries, the s3 protocol returns an error. A value of 0 specifies no lower limit.
low_speed_time
When the connection speed is less than low_speed_limit, the amount of time, in minutes, to wait before aborting a download from an S3 bucket. The default is 1 minute. A value of 0 specifies no time limit.
Note: You must ensure that there is sufficient memory on the Greenplum Database segment hosts when the s3 protocol to accesses the files. Greenplum Database allocates connections * chunksize memory on each segment host when accessing S3 files.

s3 Protocol Limitations

These are s3 protocol limitations:
  • Only the S3 path-style URL is supported.
    s3://S3_endpoint/bucketname/[S3_prefix]
  • Only the S3 endpoint is supported. The protocol does not support virtual hosting of S3 buckets (binding a domain name to an S3 bucket).
  • AWS signature version 2 and version 4 signing process are supported.

    For information about the S3 endpoints supported by each signing process, see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

  • S3 encryption is not supported. The S3 file property Server Side Encryption must be None.

About the gpcheckcloud Utility

The Greenplum Database utility gpcheckcloud helps users create an s3 protocol configuration file and test a configuration file. You can specify options to test the ability to access an S3 bucket with a configuration file, and optionally download data from files in the bucket.

If you run the utility without any options, it sends a template configuration file to STDOUT. You can capture the output and create an s3 configuration file to connect to Amazon S3.

The utility is installed in the Greenplum Database $GPHOME/bin directory.

Syntax
gpcheckcloud {-c | -d} "s3://S3_endpoint/bucketname/[S3_prefix]] config==path_to_config_file"

gpcheckcloud -t

gpcheckcloud -h
Options
-c
Connect to the specified S3 location with the configuration specified in the s3 protocol URL and return information about the files in the S3 location.
If the connection fails, the utility displays information about failures such as invalid credentials, prefix, or server address (DNS error), or server not available.
-d
Download data from the specified S3 location with the configuration specified in the s3 protocol URL and send the output to STDOUT.
If files are gzip compressed, the uncompressed data is sent to STDOUT.
-t
Sends a template configuration file to STDOUT. You can capture the output and create an s3 configuration file to connect to Amazon S3.
-h
Display gpcheckcloud help.

Examples

This example runs the utility without options to create a template s3 configuration file mytest_s3.config in the current directory.
gpcheckcloud -t > ./mytest_s3.config
This example attempts to connect to an S3 bucket location with the s3 configuration file s3.mytestconf.
gpcheckcloud -c "s3://domain/test1/abc config=s3.mytestconf"
Download all files from the S3 bucket location and send the output to STDOUT.
gpcheckcloud -d "s3://domain/test1/abc config=s3.mytestconf"