Backup and Restore Overview
Backup and Restore Overview
Greenplum Database supports parallel and non-parallel methods for backing up and restoring databases. Parallel operations scale regardless of the number of segments in your system, because segment hosts each write their data to local disk storage simultaneously. With non-parallel backup and restore operations, the data must be sent over the network from the segments to the master, which writes all of the data to its storage. In addition to restricting I/O to one host, non-parallel backup requires that the master have sufficient local disk storage to store the entire database.
Parallel Backup and Restore
The Greenplum Database parallel dump utility gpcrondump backs up the Greenplum master instance and each active segment instance at the same time.
By default, gpcrondump creates dump files in the gp_dump subdirectory of each segment instance. On the master, gpcrondump creates several dump files, containing database information such as DDL statements, the system catalog tables, and metadata files. On each segment, gpcrondump creates one dump file, which contains commands to recreate the data on that segment. Each file created for a backup begins with a 14-digit timestamp key that identifies the backup set the file belongs to.
The gpdbrestore parallel restore utility takes the timestamp key generated by gpcrondump, validates the backup set, and restores the database objects and data into a distributed database. Parallel restore operations require a complete backup set created by gpcrondump, a full backup, and any required incremental backups. As the following figure illustrates, all segments restore data from local backup files simultaneously.
The gpdbrestore utility provides flexibility and verification options for use with the automated backup files produced by gpcrondump or with backup files moved from the Greenplum cluster to an alternate location. See Restoring Greenplum Databases. gpdbrestore can also be used to copy files to the alternate location.
Non-Parallel Backup and Restore
The PostgreSQL pg_dump and pg_dumpall non-parallel backup utilities can be used to create a single dump file on the master host that contains all data from all active segments.
The PostgreSQL non-parallel utilities should be used only for special cases. They are much slower than using the Greenplum backup utilities since all of the data must pass through the master. Additionally, it is often the case that the master host has insufficient disk space to save a backup of an entire distributed Greenplum database.
The pg_restore utility requires compressed dump files created by pg_dump or pg_dumpall. Before starting the restore, you should modify the CREATE TABLE statements in the dump files to include the Greenplum DISTRIBUTED clause. If you do not include the DISTRIBUTED clause, Greenplum Database assigns default values, which may not be optimal. For details, see CREATE TABLE in the Greenplum Database Reference Guide.
To perform a non-parallel restore using parallel backup files, you can copy the backup files from each segment host to the master host, and then load them through the master. See Restoring to a Different Greenplum System Configuration.
Another non-parallel method for backing up Greenplum Database data is to use the COPY TO SQL command to copy all or a portion of a table out of the database to a delimited text file on the master host.
Backup and Restore Options
The Greenplum Database backup and restore utilities support various locations for backup files:
- With the gpcrondump utility, backup files may be saved in the default location, the db_dumps subdirectory of the master and each segment, or saved to a different directory specified with the gpcrondump -u option.
- Both the gpcrondump and gpdbrestore utilities have integrated support for EMC Data Domain Boost and Symantec NetBackup systems.
- Backup files can be saved through named pipes to any network accessible location.
- Backup files saved to the default location may be moved to an archive server on the network. This allows performing the backup at the highest transfer rates (when segments write the backup data to fast local disk arrays) and then freeing up disk space by moving the files to remote storage.
- You can backup tables belonging to one or more schema you specify on the command line or in a text file.
- You can specify schema to exclude from the backup, as command-line options or in a list provided in a text file.
- You can backup a specified set of tables listed on the command line or in a text file. The table and schema options cannot be used together in a single backup.
- In addition to database objects, gpcrondump can backup the configuration files pg_hba.conf, pg_ident.conf, and postgresql.conf, and global database objects, such as roles and tablespaces.
You can create incremental backups:
- An incremental backup contains only append-optimized and column-oriented tables that have changed since the most recent incremental or full backup.
- For partitioned append-optimized tables, only changed append-optimized/column-oriented table partitions are backed up.
- Incremental backups include all heap tables.
- Use the gpchrondump --incremental flag to specify an incremental backup.
- Restoring an incremental backup requires a full backup and all subsequent incremental backups, up to the backup you are restoring.
- By default, gpdbrestore restores data to the database it was backed up from.
- The --redirect flag allows you to restore a backup to a different database.
- The restored database can be dropped and recreated, but the default is to restore into an existing database.
- Selected tables can be restored from a backup by listing the tables on the command line or by listing them in a text file and specifying the text file on the command line.
- You can restore a database from backup files moved to an archive server. The backup files are copied back into place on the master host and each segment host and then restored to the database.