gpcopy

gpcopy

The gpcopy utility copies objects from databases in a source Greenplum Database system to databases in a destination Greenplum Database system.

Note: The gpcopy utility is available only in the commercial release of Pivotal Greenplum Database.

Synopsis

gpcopy
   { --full |
     { --dbname database1[, database2 ... ] 
        [--dest-dbname dest-db1[, dest-db2 ... ]] } |
     --include-table db.schema.table[, db.schema1.table1 ... ] |
     --include-table-file table-file1
        [ --include-table-file table-file2] ... ] }
   [--exclude-table db.schema.table[, db.schema1.table1 ... ]]
   [--exclude-table-file table-file1]
      [ --exclude-table-file table-file2] ... ]]
   [--skip-existing | --truncate | --drop | --append ]
   [--parallelize-leaf-partitions] 
   [--dry-run]
   [--analyze]
   [--validate type]
   [--no-distribution-check]
   [--truncate-source-after [--yes]]
   [--metadata-only]
   { --dest-host dest_host [--dest-port dest_port]
      [ --dest-user dest_user] }
   [--source-host source_host [--source-port source_port]
   [--source-user source_user]]
   [--no-compression]
   [--on-segment-threshold int]
   [--jobs int]
   [--quiet | --debug]

gpcopy --version
gpcopy --help

Description

The gpcopy utility copies database objects from a source Greenplum Database system to a destination system. You can perform one of the following types of copy operations:

  • Copy a Greenplum Database system with the --full option. This option copies all database objects including, tables, table data, indexes, views, users, roles, functions, and resource queues for all user-defined databases to a different destination system.
  • Copy a set of user-defined database tables to a destination system.
    • The --dbname option copies all user-defined tables, table data, and re-creates the table indexes from specified databases.
    • The --include-table or --include-table-file option copies a specified set of user-defined tables, table data, and re-creates the table indexes.
    • The --exclude-table and --exclude-table-file options exclude a specified set of user-defined tables and table data to be copied.
  • Copy only the database schemas with the --metadata-only option.
When running gpcopy, you must specify the data to copy from the source database and how to manage data in the destination database.
  • You must use one and only one of these options to specify the data to be copied from the source database: --full, --dbname, --include-table, or --include-table-file.
  • You must use one of these options to specify how manage data in the destination database: --skip-existing, --truncate, --drop, or --append.

    If you specify both the --append and --validation options, validation of source table data fails if a destination table contains data.

If you specify the option --truncate-source-after, you must also specify the --validate option. When --truncate-source-after is specified, gpcopy truncates the source table after the table data is copied and destination table data has been validated.

Prerequisites

The user IDs connecting to the source and destination Greenplum Database systems, the --source-user and --dest-user, must have appropriate access to the systems.

The source and destination Greenplum Database segment hosts need to be able to communicate with each other. To ensure that the segment hosts can communicate, you can use a tool such as the Linux netperf utility.

When the --full option is specified, resource groups and table spaces are copied, however, the utility does not configure the destination system. You must configure the system to use resource groups and create filespaces for the tablespaces.

Options

--analyze
Run the ANALYZE command on non-system tables. The default is to not run the ANALYZE command. The operation is performed for each table after the table data is copied.
--append
Append data to the table in the destination database if it already exists.
At most, only one of the options can be specified: --skip-existing, --truncate, --drop, or --append.
--dry-run
When you specify this option, gpcopy generates a list of the migration operations that would have been performed with the specified options. The data is not migrated.
The information is displayed at the command line and written to the log file.
--jobs int
Specify the number processes that gpcopy runs in parallel. The default is 4. The range is from 1 to 64.
The option --jobs produces 2*n+1 database connections. The default value 4, creates 9 connections.
If you increase this option, ensure that Greenplum Database systems are configured with a sufficient maximum concurrent connection value to accommodate the gpcopy connections and other concurrent connections such as user connections. See the Greenplum Database server configuration parameter max_connections.
--dbname database
A source database to copy. To copy multiple databases to the destination system specify a comma-separated list of databases with no spaces between the names. All the user-defined tables and table data are copied to the destination system.
If the source database does not exist, gpcopy returns an error and quits. If a destination database does not exist a database is created.
Not valid with the --full, --include-table, or --include-table-file, options.
Alternatively, you can copy a set of tables with the --include-table, or --include-table-file option.
--debug
If specified, debug messages are displayed at the command prompt.
This option cannot be specified with the --quiet option.
--dest-dbname database
To copy a database to a different destination database, specify the name of the destination database. For multiple databases, specify a comma-separated list of databases with no spaces between the names. The number of database names must match the number of names specified in the --dbname option. The utility copies the source databases to the destination databases in the listed order. In this example, db1 is copied to destdb1, db2 is copied to destdb2, and db3 is copied to db3.
gpcopy --dest-host mdw-2 --dbname=db1,db2,db3 --dest-dbname=destdb1,destdb2,db3 --drop
If the source database does not exist, gpcopy returns an error and quits. If a destination database does not exist a database is created.
Valid only with the --dbname option.
--dest-host dest_host
Required. Destination Greenplum Database hostname or IP address.
--dest-port dest_port
Destination Greenplum Database port number, If not specified, the default is 5432.
--dest-user dest_user
User ID that is used to connect to the destination Greenplum Database system. If not specified, the default is the user gpadmin.
--drop
Specify this option to drop the table that is in the destination database if it already exists. Before copying table data, gpcopy drops the table and creates it again.
At most, only one of the options can be specified: --skip-existing, --truncate, --drop, or --append.
--exclude-table db.schema.table
A table from the source database system to exclude from transfer. The fully qualified table name must be specified (database.schema.table).
To exclude multiple tables, specify a comma-separated list of table names.
A set of tables can be specified using the Go language regular expression syntax. See the --include-table option for information about using regular expressions.
Only the specified tables are excluded, not dependent objects. You cannot specify views or system catalog tables.
This option must be specified with one of these options: --full, --dbname, --include-table, or --include-table-file. If the option --exclude-table results in no table to copy, the database or schema is not created in the destination system.
--exclude-table-file table-file
The location and name of file containing a list of fully qualified table names to exclude from copying to the destination system. In the text file, specify a single fully qualified table per line (database.schema.table). To specify multiple files, specify this option for each file.
--exclude-table-file <path_to_file1> --exclude-table-file <path_to_file2>
In the file, a set of tables can be specified using the Go language regular expression syntax. See the --include-table option for information about using regular expressions.
If a source table does not exist, gpcopy displays a warning.
Only the specified tables are excluded. You cannot specify views or system catalog tables.
This option must be specified with one of these options: --full, --dbname, --include-table, or --include-table-file. If the option --exclude-table results in no tables to copy, the database or schema is not created in the destination system.
--include-table db.schema.table
A table from the source database system to copy. The fully qualified table name must be specified (database.schema.table). You cannot specify views or system catalog tables. To copy multiple tables, specify a comma-separated list of table names.
A set of tables can be specified using the Go language regular expression syntax. You can use regular expressions in table names and schema names. Regular expressions are not supported for database names.
The regular expression pattern must be enclosed in slashes (/RE_pattern/). This example --include-table mytest.public.demo/.*/ specifies all tables that begin with demo in the database mytest in the public schema.
The following two examples for the --include-table option are equivalent. They both specify a set of tables that begins with demo and ends with zero or more digits.
--include-table testdb.schema1.demo/[0-9]*/
--include-table testdb.schema1./demo[0-9]*/
If the source table does not exist, gpcopy returns an error and quits.
If the destination table or database does not exist, it is created. Only the table and table data are copied, not dependant objects. Indexes are re-created only if the --drop option is specified. Dependent objects are not copied.
This option is not allowed with the options: --full, --dbname, or --include-table-file.
--include-table-file table-file
The location and name of file containing a list of fully qualified table names to copy from the Greenplum Database source system. To specify multiple files, specify this option for each file.
--include-table-file <path_to_file1> --include-table-file <path_to_file2>
In the text file, specify a single fully qualified table per line (database.schema.table). You cannot specify views or system catalog tables. In the file, a set of tables can be specified with regular expression syntax. See the --include-table option for information about using regular expressions.
This option cannot be specified with the --full, --dbname, or --include-table options.
--full
This option performs a migration of a Greenplum Database source system to a destination system.
A migration copies all database objects including, tables, indexes, views, roles, functions, user-defined types (UDT), resource queues, and resource groups for all user-defined databases. The default databases, postgres, template0, and template1, are not copied.
This option cannot be specified with the --dbname, --include-table, or --include-table-file options.
--no-compression
If specified, data is transferred without compression. By default, gpcopy compresses data during transfer from the source to the destination database when copying data to a different host.
The utility does not compress data when copying data to the same host.
--no-distribution-check
Specify this option to disable table data distribution checking. By default, gpcopy performs data distribution checking to ensure data is distributed to segment instances correctly. If distribution checking fails, the table copy fails.
The utility does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table.
Warning: Before you perform a gpcopy operation with the --no-distribution-check option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.
--on-segment-threshold int
Specifies the number of rows that determines when gpcopy copies a table using the Greenplum Database source and destination master instead of the source and destination segment instances. The default value is 10000 rows. If a table contains 10000 rows or less, the table is copied using the Greenplum Database master.
The value -1 disables copying tables using the master. All tables are copied using the segment instances.
For smaller tables, copying tables using the Greenplum Database master is more efficient than using segment instances.
--parallelize-leaf-partitions
If specified, the utility copies the leaf partition tables of a partitioned table in parallel. The default is to copy the partitioned table as a single table based on the root partition table.
If the --validate option is also specified, the utility validates each leaf partition table during the copy process and then validates the entire partitioned table.
--quiet
If specified, suppress status messages at the command prompt. The messages are sent only to the log file. Higher level messages such as warning and error messages are still displayed.
This option cannot be specified with the --debug option.
--metadata-only
Create only the schemas specified by the command. Data is not transferred.
If specified with the --full option, gpcopy replicates the complete database schema, including all tables, indexes, views, user-defined types (UDT), and user-defined functions (UDF) for the source databases. No data is transferred.
If you specify databases with the --dbname option or tables with the --include-table or --include-table-file options, gpcopy creates only the tables and indexes. No data is transferred.
This option cannot be used with --truncate option.
--skip-existing
Specify this option to skip copying a table from the source database if the table already exists in the destination database.
At most, only one of the options can be specified: --skip-existing, --truncate, --drop, or --append.
--source-host source_host
Source Greenplum Database host name or IP address. If not specified, the default host is the system running gpcopy (127.0.0.1).
--source-port source_port
Source Greenplum Database port number. If not specified, the default is 5432.
--source-user source_user
User ID that is used to connect to the source Greenplum Database system. If not specified, the default is the user gpadmin.
--truncate
Specify this option to truncate the table that is in the destination database if it already exists.
At most, only one of the options can be specified: --skip-existing, --truncate, --drop, or --append.
--truncate-source-after
Specify this option to truncate the table that is in the source database after gpcopy copies the table and validates the table data in the destination database.
If you specify this option, you must also specify the --validate option.
--yes
Optional. Automatic confirmation to truncate source table data after copying and validating table data. The prompt to truncate source tables does not appear. The default is to prompt to confirm truncating source table data after copying and validating the data.
--validate type
Perform data validation on the table data in the destination database after the table data is copied. These are the supported type of validation.
  • count - compares row counts between source and destination table data.
  • md5xor - calculates the MD5 value of all rows, then performs an XOR over the MD5 values
If you specify the --append option, and the destination table contains data, validation fails for the table.
If validation for a table fails, gpcopy rolls back the destination table.
--help
Displays the online help.
--version
Displays the version of this utility.

Notes

If a gpcopy command specifies an invalid option, or specifies a source table or database that does not exist, the utility returns an error and quits. No data is copied.

The source table data can change while the data is being copied. A lock is not acquired on the source table when data is copied.

The utility cannot copy a row with a width greater than 1GB (a PostgreSQL limitation).

If you copy a set of database tables with the --dbname, --include-table, or --include-table-file option, and the destination database does not exist, the utility creates the database before copying the tables. If the destination database exists, the utility creates the tables in the database if required.

The gpcopy utility does not copy dependent database objects unless you specify the --full option. For example, if a table has a default value on a column that is a user-defined function, that function must exist in the destination system database when using the --dbname, --include-table, or --include-table-file options.

When copying tables, sequences defined for tables are considered table data and are copied. A sequence is copied if a table is created with a serial column or if a sequence is specified as a default value. The sequences are reset if you specify the --truncate option.

The utility re-creates table indexes only with --full or --drop options.

The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration separately.

The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must install the external objects separately.

Specifying Table Names with Special Characters

When you list a table with the option --include-table or --exclude-table, and the table name or schema name contains single quote ('), double quote (") , or backslash (\), you must escape the character with a backslash (\). These are the escaped characters: \', \", \\. If the table or schema name contains a period (.), comma (,), or a space character, you must enclose the entire name in single quotes (') and enclose the table and schema in double quotes (") at the command prompt shell. In this example, the fully qualified name of the table is testdb.schema'1.table"test.
--include-table 'testdb."schema\'1"."table\"test"'
For fully qualified table names listed in a file that is used the option --include-table-file or --exclude-table-file, if the table name or schema name contains a period (.), the name must be enclosed in double quotes ("). In this example, the table table.test is in the testdb, and belongs to the schema schema.1.
testdb."schema.1"."table.test"

Copying Partitioned Tables

When copying data for a partitioned table, if a leaf partition has been exchanged with an external table, that leaf partition is created, but data is not copied.

If you specify copying leaf partitions of a partitioned table with an option such as --include-table or --exclude-table, gpcopy creates the partitioned table if it does not exist. Only the data for the specified leaf partitions are added to the partitioned table. Specifying individual leaf partitions is useful when the entire partitioned tabled does not need to be copied.

gpcopy does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy operation and specify the option --no-distribution-check to disable checking of data distribution.

Warning: Before you perform a gpcopy operation with the --no-distribution-check option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.

Handling gpcopy Errors

When gpcopy encounters errors and quits or is cancelled by the user, current copy operations on tables in the destination database are rolled back. Copy operations that have completed are not rolled back.

If an error occurs during the process of copying a table, or table validation fails, gpcopy continues copying the other specified tables. After gpcopy finishes, it displays a list of tables where errors occurred or validation failed and displays a gpcopy command. You can use the provided command to retry copying the failed tables.

The gpcopy utility logs messages in log file gpcopy_date.log in the ~/gpAdminLogs directory on the master host. If you run multiple gpcopy commands on the same day, the utility appends messages to that day's log file.

After gpcopy completes, it displays a summary of the operations performed. If the utility fails to copy tables, they are highlighted in summary, and there is a gpcopy command provided in summary for user to just copy the failed tables. The information is displayed at the command prompt and in the gpcopy log file.

After resolving the issues that caused the copy operations to fail, you can run the provided command to copy the tables that failed in the previous gpcopy command.

Database Connections Created by gpcopy

The option --jobs produces 2*n+1 database connections. The default value 4, creates 9 connections.

If you increase this option, ensure that both source and destination Greenplum Database is configured with a sufficient maximum concurrent connection value to accommodate the gpcopy connections and other concurrent connections such as user connections. See the Greenplum Database server configuration parameter max_connections.

Examples

This command copies all user created databases in a source system to a destination system with the --full option. And drops the table and creates it again if it already exists in the destination.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
    --dest-host demohost --dest-port 1234 --dest-user gpuser \
    --full --drop

This command copies the specified databases in a source system to a destination system with the --dbname option. The --truncate option truncates the table data before copying table data from the source table.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --dbname database1, database2 --truncate

This command copies the specified tables in a source system to a destination system with the --include-table option. The --skip-existing option skips the table if it already exists in the destination database.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --include-table database.schema.table1, database.schema.table2 --skip-existing

This command copies the tables from the source database to the destination system, excluding the tables specified in the specified tables in /home/gpuser/mytables with --exclude-table-file option. The --truncate option truncates tables that already exist in the destination system. With the options --analyze and --validate count, the utility performs an ANALYZE operation on the copied tables, and validates the copied table data by comparing row counts between source and destination tables.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --dbname database1 --exclude-table-file /home/gpuser/mytables \
   --truncate --analyze --validate count

This command specifies the --full and --metadata-only options to copy the complete database schema, including all tables, indexes, views, user-defined types (UDT), and user-defined functions (UDF) from all the source databases. No data is copied, The --drop option specifies that the table is dropped in the destination database before it is created again if the table exists in both the source and destination database.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --full --metadata-only --drop

This command copies the specified databases in a source system to a destination system with the --dbname option and specifies 8 parallel processes with the --jobs option. The command specifies the --truncate option to truncate the table and create it again if it already exists in the destination database.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --dbname database1, database2 --truncate --jobs 8

This command copies the specified database in a source system to a destination system with the --dbname option and specifies 16 parallel processes with the --jobs option. The --truncate option truncates the table and creates it again if it already exists in the destination database. The --truncate-source-after option truncates the tables in source database after that table data has been validated in the destination database.

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
   --dest-host demohost --dest-port 1234 --dest-user gpuser \
   --dbname database1 --truncate --jobs 16 --truncate-source-after --validate count

In the previous example, if --truncate was not specified and the destination table contained data, validation would fail.

This is an example table file that uses regular expressions.

"test1.arc/.*/./.*/"
"test1.c/(..)/y./.*/"

In the first line, the regular expressions for the schemas, arc/.*/, and for the tables, /.*/, limit the transfer to all tables with the schema names that start with arc.

In the second line, the regular expressions for the schemas, c/(..)/y, and for the tables, /.*/, limit the transfer to all tables with the schema names that are four characters long and that start with c and end with y, for example, crty.

When the command is run, tables in the database test1 that satisfy either condition are copied to the destination database.

See Also

For information about migrating data, see the Greenplum Database Administrator Guide.