Parallel Backup with gpbackup and gprestore

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.

Parallel Backup with gpbackup and gprestore

Note: gpbackup and gprestore are experimental utilities and are not intended for use in a production environment. Experimental features are subject to change without notice in future releases.

gpbackup and gprestore are new utilities that provide an improved way of creating and restoring backup sets for Greenplum Database. By default, gpbackup stores only the object metadata files and DDL files for a backup in the Greenplum Database master data directory. Greenplum Database segments use the COPY .. ON SEGMENT command to store their data for backed-up tables in compressed CSV data files, located in each segment's backups directory.

The backup metadata files contain all of the information that gprestore needs to restore a full backup set in parallel. Backup metadata also provides the framework for restoring only individual objects in the data set, along with any dependent objects, in future versions of gprestore. (See Understanding Backup Files for more information.) Storing the table data in CSV files also provides opportunities for using other restore utilities, such as gpload, to load the data either in the same cluster or another cluster. By default, one file is created for each table on the segment. You can specify the -leaf-partition-data option with gpbackup to create one data file per leaf partition of a partitioned table, instead of a single file. This option also enables you to filter backup sets by leaf partitions.

Each gpbackup task uses a single transaction in Greenplum Database. During this transaction, metadata is backed up on the master host, and data for each table on each segment host is written to CSV backup files using COPY .. ON SEGMENT commands in parallel. The backup process acquires an ACCESS SHARE lock on each table that is backed up.

Requirements and Limitations

You can use gpbackup and gprestore on Greenplum Database systems that support the COPY .. ON SEGMENT command (Greenplum Database 5.1.0 and later, or 4.3.17.0 and later).

gpbackup and gprestore are experimental features in this release, and have the following limitations:
  • If you create an index on a parent partitioned table, gpbackup does not back up that same index on child partitioned tables of the parent, as creating the same index on a child would cause an error. However, if you exchange a partition, gpbackup does not detect that the index on the exchanged partition is inherited from the new parent table. In this case, gpbackup backs up conflicting CREATE INDEX statements, which causes an error when you restore the backup set.
  • You can execute multiple instances of gpbackup, but each execution requires a distinct timestamp.
  • Database object filtering is currently limited to schemas and tables.
  • If you use the gpbackup -single-data-file option to combine table backups into a single file per segment, you cannot perform a parallel restore operation with gprestore (cannot set -jobs to a value higher than 1).
  • You cannot use the -exclude-table-file with -leaf-partition-data. Although you can specify leaf partition names in a file specified with -exclude-table-file, gpbackup ignores the partition names.
  • Incremental backups are not supported.

Objects Included in a Backup or Restore

The following table lists the objects that are backed up and restored with gpbackup and gprestore. Database objects are backed up for the database you specify with the -dbname option. Global objects (Greenplum Database system objects) are also backed up by default, but they are restored only if you include the -globals option to gprestore.
Table 1. Objects that are backed up and restored
Database (for database specified with -dbname) Global (requires the -globals option to restore)
  • Session-level configuration parameter settings (GUCs)
  • Schemas
  • Procedural language extensions
  • Sequences
  • Comments
  • Tables
  • Owners
  • Writable External Tables (DDL only)
  • Readable External Tables (DDL only)
  • Functions
  • Aggregates
  • Casts
  • Types
  • Views
  • Protocols
  • Triggers. (While Greenplum Database does not support triggers, any trigger definitions that are present are backed up and restored.)
  • Rules
  • Domains
  • Operators, operator families, and operator classes
  • Conversions
  • Text search parsers, dictionaries, templates, and configurations
  • Tablespaces
  • Databases
  • Database-wide configuration parameter settings (GUCs)
  • Resource group definitions
  • Resource queue definitions
  • Roles
  • GRANT assignments of roles to databases

See also Understanding Backup Files.

Performing Basic Backup and Restore Operations

To perform a complete backup of a database, as well as Greenplum Database system metadata, use the command:
$ gpbackup -dbname <database_name>
For example:
$ gpbackup -dbname demo
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Starting backup of database demo
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Backup Timestamp = 20171103152558
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Backup Database = demo
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Backup Type = Unfiltered Compressed Full Backup
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Acquiring ACCESS SHARE locks on tables
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Locks acquired
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Gathering table metadata
20171103:15:25:58 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Writing global database metadata to /gpmaster/gpsne-1/backups/20171103/20171103152558/gpbackup_20171103152558_global.sql
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Global database metadata backup complete
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Writing pre-data metadata to /gpmaster/gpsne-1/backups/20171103/20171103152558/gpbackup_20171103152558_predata.sql
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Pre-data metadata backup complete
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Writing post-data metadata to /gpmaster/gpsne-1/backups/20171103/20171103152558/gpbackup_20171103152558_postdata.sql
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Post-data metadata backup complete
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Writing data to file
Tables backed up:  2 / 2 [==============================================================] 100.00% 0s
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Data backup complete
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[WARNING]:-Found neither /usr/local/greenplum-db/./bin/mail_contacts nor /home/gpadmin/mail_contacts
20171103:15:25:59 gpbackup:gpadmin:0ee2f5fb02c9:017527-[WARNING]:-Unable to send backup email notification
20171103:15:26:00 gpbackup:gpadmin:0ee2f5fb02c9:017527-[INFO]:-Backup completed successfully
The above command creates global and database-specific metadata files on the Greenplum Database master host in the default directory, $MASTER_DATA_DIRECTORY/backups/<YYYYMMDD>/<YYYYMMDDHHMMSS>/. For example:
$ ls /gpmaster/gpsne-1/backups/20171103/20171103152558/
gpbackup_20171103152558_config.yaml  gpbackup_20171103152558_postdata.sql  gpbackup_20171103152558_report
gpbackup_20171103152558_global.sql   gpbackup_20171103152558_predata.sql   gpbackup_20171103152558_toc.yaml
By default, each segment stores each table's data for the backup in a separate compressed CSV file in <seg_dir>/backups/<YYYYMMDD>/<YYYYMMDDHHMMSS>/:
$ ls /gpdata1/gpsne0/backups/20171103/20171103152558/
gpbackup_0_20171103152558_16524.gz  gpbackup_0_20171103152558_16543.gz
To consolidate all backup files into a single directory, include the -backupdir option. Note that you must specify an absolute path with this option:
$ gpbackup -dbname demo -backupdir /home/gpadmin/backups
20171103:15:31:56 gpbackup:gpadmin:0ee2f5fb02c9:017586-[INFO]:-Starting backup of database demo
...
20171103:15:31:58 gpbackup:gpadmin:0ee2f5fb02c9:017586-[INFO]:-Backup completed successfully
$ find /home/gpadmin/backups/ -type f
/home/gpadmin/backups/gpseg0/backups/20171103/20171103153156/gpbackup_0_20171103153156_16543.gz
/home/gpadmin/backups/gpseg0/backups/20171103/20171103153156/gpbackup_0_20171103153156_16524.gz
/home/gpadmin/backups/gpseg1/backups/20171103/20171103153156/gpbackup_1_20171103153156_16543.gz
/home/gpadmin/backups/gpseg1/backups/20171103/20171103153156/gpbackup_1_20171103153156_16524.gz
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_config.yaml
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_predata.sql
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_global.sql
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_postdata.sql
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_report
/home/gpadmin/backups/gpseg-1/backups/20171103/20171103153156/gpbackup_20171103153156_toc.yaml

Restoring from Backup

To use gprestore to restore from a backup set, you must use the -timestamp option to specify the exact timestamp value (YYYYMMDDHHMMSS) to restore. Include the -createdb option if the database does not exist in the cluster. For example:
$ dropdb demo
$ gprestore -timestamp 20171103152558 -createdb
20171103:15:45:30 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Restore Key = 20171103152558
20171103:15:45:31 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Creating database
20171103:15:45:44 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Database creation complete
20171103:15:45:44 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Restoring pre-data metadata from /gpmaster/gpsne-1/backups/20171103/20171103152558/gpbackup_20171103152558_predata.sql
20171103:15:45:45 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Pre-data metadata restore complete
20171103:15:45:45 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Restoring data
20171103:15:45:45 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Data restore complete
20171103:15:45:45 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Restoring post-data metadata from /gpmaster/gpsne-1/backups/20171103/20171103152558/gpbackup_20171103152558_postdata.sql
20171103:15:45:45 gprestore:gpadmin:0ee2f5fb02c9:017714-[INFO]:-Post-data metadata restore complete
If you specified a custom -backupdir to consolidate the backup files, include the same -backupdir option when using gprestore to locate the backup files:
$ dropdb demo
$ gprestore -backupdir /home/gpadmin/backups/ -timestamp 20171103153156 -createdb
20171103:15:51:02 gprestore:gpadmin:0ee2f5fb02c9:017819-[INFO]:-Restore Key = 20171103153156
...
20171103:15:51:17 gprestore:gpadmin:0ee2f5fb02c9:017819-[INFO]:-Post-data metadata restore complete

gprestore does not attempt to restore global metadata for the Greenplum System by default. If this is required, include the -globals argument.

By default, gprestore uses 1 connection to restore table data. If you have a large backup set, you can improve performance of the restore by increasing the number of parallel connections with the -jobs option. For example:
$ gprestore -backupdir /home/gpadmin/backups/ -timestamp 20171103153156 -createdb -jobs 8

Test the number of parallel connections with your backup set to determine the ideal number for fast data recovery.

Filtering the Contents of a Backup or Restore

gpbackup backs up all schemas and tables in the specified database, unless you exclude or include individual schema or table objects. You can filter the database schemas that are backed up by using either the -include-schema or -exclude-schema command-line options to gpbackup. For example, if the "demo" database includes only two schemas, "wikipedia" and "twitter," both of the following commands back up only the "wikipedia" schema:
$ gpbackup -dbname demo -include-schema wikipedia
$ gpbackup -dbname demo -exclude-schema twitter
You can include multiple -include-schema options in a gpbackup or multiple -exclude-schema options. For example:
$ gpbackup -dbname demo -include-schema wikipedia -include-schema twitter
To filter the individual tables that are included in a backup set, create a list of qualified table names in a text file to use for including or excluding the tables from a backup. Each line in the text file must define a single table using the format <schema-name>.<table-name>. The file must not include trailing lines. For example:
wikipedia.articles
twitter.message
If a table or schema name uses any character other than a lowercase letter, number, or an underscore character, then you must include that name in double quotes. For example:
beer."IPA"
"Wine".riesling
"Wine"."sauvignon blanc"
water.tonic
After creating the file, you can use it either to include or exclude tables with the gpbackup options -include-table-file and -exclude-table-file. For example:
$ gpbackup -dbname demo -include-table-file /home/gpadmin/table-list.txt

Filtering by Leaf Partition

By default, gpbackup creates one file for each table on a segment. You can specify the -leaf-partition-data option with to create one data file per leaf partition of a partitioned table, instead of a single file. This option also enables you to filter backups to specific leaf partitions by listing the partition names in a text file to include. For example, consider a table that was created using the statement:
demo=# CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( PARTITION Jan17 START (date '2017-01-01') INCLUSIVE ,
PARTITION Feb17 START (date '2017-02-01') INCLUSIVE ,
PARTITION Mar17 START (date '2017-03-01') INCLUSIVE ,
PARTITION Apr17 START (date '2017-04-01') INCLUSIVE ,
PARTITION May17 START (date '2017-05-01') INCLUSIVE ,
PARTITION Jun17 START (date '2017-06-01') INCLUSIVE ,
PARTITION Jul17 START (date '2017-07-01') INCLUSIVE ,
PARTITION Aug17 START (date '2017-08-01') INCLUSIVE ,
PARTITION Sep17 START (date '2017-09-01') INCLUSIVE ,
PARTITION Oct17 START (date '2017-10-01') INCLUSIVE ,
PARTITION Nov17 START (date '2017-11-01') INCLUSIVE ,
PARTITION Dec17 START (date '2017-12-01') INCLUSIVE
END (date '2018-01-01') EXCLUSIVE );
NOTICE:  CREATE TABLE will create partition "sales_1_prt_jan17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_feb17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_mar17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_apr17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_may17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_jun17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_jul17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_aug17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_sep17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_oct17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_nov17" for table "sales"
NOTICE:  CREATE TABLE will create partition "sales_1_prt_dec17" for table "sales"
CREATE TABLE
To back up only data for the last quarter of the year, first create a text file that lists those leaf partition names instead of the full table name:
public.sales_1_prt_oct17
public.sales_1_prt_nov17
public.sales_1_prt_dec17
Then specify the file with the -include-table-file option, and also specify -leaf-partition-data to generate one data file per leaf partition:
$ gpbackup -dbname demo -include-table-file last-quarter.txt -leaf-partition-data
Note: You cannot use the -exclude-table-file with -leaf-partition-data. Although you can specify leaf partition names in a file specified with -exclude-table-file, gpbackup ignores the partition names.

Filtering with gprestore

After creating a backup set with gpbackup, you can filter the schemas and tables that you want to restore from the backup set using the gprestore -include-schema and -include-table-file options. These options work in the same way as their gpbackup counterparts, but have the following restrictions:
  • The schemas or tables that you attempt to restore must not already exist in the database.
  • If you attempt to restore a schema or table that does not exist in the backup set, the gprestore does not execute.
  • Backup sets that use a single data file for all tables on a segment (gpbackup -single-data-file option) cannot be restored using the -include-schema option.
  • If you use the -include-schema option, gprestore cannot restore objects that have dependencies on multiple schemas.
  • If you use the -include-table-file option, gprestore does not restore indexes, create roles, or set the owner of the tables.
  • The file that you specify with -include-table-file cannot include a leaf partition name, as it can when you specify this option with gpbackup.

Configuring Email Notifications

gpbackup will send out status email notifications after a back up operation completes, if you place a file named mail_contacts in the home directory of the Greenplum database superuser (gpadmin) or in the same directory as the gpbackup utility ($GPHOME/bin).

This file must contain one email address per line. gpbackup issues a warning if it cannot locate a mail_contacts file in either location. If both locations have a mail_contacts file, then the one in $HOME takes precedence.

Note: The UNIX mail utility must be running on the Greenplum Database host and must be configured to allow the Greenplum superuser (gpadmin) to send email.

Understanding Backup Files

Warning: All gpbackup metadata files are created with read-only permissions. Never delete or modify the metadata files for a gpbackup backup set. Doing so will render the backup files non-functional.

A complete backup set for gpbackup includes multiple metadata files, supporting files, and CSV data files, each designated with the timestamp at which the backup was created.

By default, metadata and supporting files are stored on the Greenplum Database master host in the directory $MASTER_DATA_DIRECTORY/backups/YYYYMMDD/YYYYMMDDHHMMSS/. If you specify a custom backup directory, this same file path is created as a subdirectory of the backup directory. The following table describes the names and contents of the metadata and supporting files.
Table 2. gpbackup Metadata Files (master)
File name Description
gpbackup_<YYYYMMDDHHMMSS>_global.sql Contains DDL for objects that are global to the Greenplum Cluster, and not owned by a specific database within the cluster. These objects include:
  • Tablespaces
  • Databases
  • Database-wide configuration parameter settings (GUCs)
  • Resource group definitions
  • Resource queue definitions
  • Roles
  • GRANT assignments of roles to databases

Note that global metadata is not restored by default. You must include the -globals options to gprestore to restore global metadata.

gpbackup_<YYYYMMDDHHMMSS>_predata.sql Contains DDL for objects in the backed-up database (specified with -dbname) that must be created prior to restoring the actual data. These objects include:
  • Session-level configuration parameter settings (GUCs)
  • Schemas
  • Procedural language extensions
  • Types
  • Sequences
  • Functions
  • Tables
  • Protocols
  • Operators and operator classes
  • Conversions
  • Aggregates
  • Casts
  • Views
  • Constraints
gpbackup_<YYYYMMDDHHMMSS>_postdata.sql Contains DDL for objects in the backed-up database (specified with -dbname) that must be created after restoring the data. These objects include:
  • Indexes
  • Rules
  • Triggers. (While Greenplum Database does not support triggers, any trigger definitions that are present are backed up and restored.)
gpbackup_<YYYYMMDDHHMMSS>_toc.yaml Contains metadata for locating object DDL in the _predata.sql and _postdata.sql files. This file also contains the table names and OIDs used for locating the corresponding table data in CSV data files that are created on each segment. See Segment Data Files.
gpbackup_<YYYYMMDDHHMMSS>_report Contains information about the backup operation that is used to populate the email notice (if configured) that is sent after the backup completes. This file contains information such as:
  • Command-line options that were provided
  • Database that was backed up
  • Database version
  • Backup type
See Configuring Email Notifications.
gpbackup_<YYYYMMDDHHMMSS>_config.yaml Contains metadata about the execution of the particular backup task, including:
  • gpbackup version
  • Database name
  • Greenplum Database version
  • Additional option settings such as -no-compression, -compression-level, -metadata-only, -data-only, and -with-stats.

Segment Data Files

By default, each segment creates one compressed CSV file for each table that is backed up on the segment. You can optionally specify the -single-data-file option to create a single data file on each segment. The files are stored in <seg_dir>/backups/YYYYMMDD/YYYYMMDDHHMMSS/.

If you specify a custom backup directory, segment data files are copied to this same file path as a subdirectory of the backup directory. If you include the -leaf-partition-data option, gpbackup creates one data file for each leaf partition of a partitioned table, instead of just one table for file.

Each data file uses the file name format gpbackup_<content_id>_<YYYYMMDDHHMMSS>_<oid>.gz where:
  • <content_id> is the content ID of the segment.
  • <YYYYMMDDHHMMSS> is the timestamp of the gpbackup operation.
  • <oid> is the object ID of the table. The metadata file gpbackup_<YYYYMMDDHHMMSS>_toc.yaml references this <oid> to locate the data for a specific table in a schema.

You can optionally specify the gzip compression level (from 1-9) using the -compression-level option, or disable compression entirely with -no-compression. If you do not specify a compression level, gpbackup uses compression level 1 by default.