Migrating Data with gpcopy

Migrating Data with gpcopy

This topic describes how to use the gpcopy utility to transfer data between databases in different Greenplum Database clusters.

gpcopy is a high-performance utility that can copy metadata and data from one Greenplum database to another Greenplum database. You can migrate the entire contents of a database, or just selected tables. The source and destination databases must be in different clusters, and the clusters can have different Greenplum Database versions. For example, you can use gpcopy to migrate data from a version 4.x Greenplum Database to version 5.x.

The gpcopy interface includes options to transfer one or more full databases, or one or more database tables. A full database transfer includes the database schema, table data, indexes, views, roles, user-defined functions, resource queues, and resource groups. If a copied table or database does not exist in the destination cluster, gpcopy creates it automatically, along with indexes as necessary.

Configuration files, including postgresql.conf and pg_hba.conf, must be transferred manually by an administrator. Extensions installed in the database with gppkg, such as MADlib and programming language extensions, must be installed in the destination database by an administrator.

gpcopy uses many of the same command-line options as the earlier gptransfer utility. gpcopy provides several improvements as compared to gptransfer:
  • gpcopy performs significantly faster than gptransfer when copying large amounts of data.
  • gpcopy provides more detailed reporting and summary information about all aspects of the copy operation.
  • gpcopy allows the source table data to change while the data is being copied. A lock is not acquired on the source table when data is copied.

  • The gpcopy utility includes the --truncate-source-after option to help migrate data from one Pivotal Greenplum Database system to another on the same hardware, requiring minimal free space available.

Prerequisites

The source and destination Greenplum Database systems must already exist, have network access between all hosts, and have master host and primary segment hosts in both systems. Each system must be configured with the same number of segments.

gpcopy is dependent on the pg_dump, pg_dumpall, and psql utilities installed with Greenplum Database. In most cases, you run gpcopy from a Greenplum Database cluster, so the dependencies are automatically met. If you need to run gpcopy on a remote server, such as an ETL system, copy the gpcopy binary and install a compatible Greenplum Clients package to meet the gpcopy dependencies.

gpcopy supports migrating data from a Greenplum Database 4.3.26 or later cluster to a Greenplum Database 5.9 or later cluster. However Greenplum Database 4.3.26 and later do not include the actual gpcopy utility. You must manually copy the gpcopy utility from your version 5.9 or later cluster into the older version cluster to migrate data.. For example:
$ cp /usr/local/greenplum-db-5.8.0/bin/gpcopy /usr/local/greenplum-db-4.3.26.0/bin/

Limitations for the Source and Destination Systems

gpcopy cannot copy data from one database to another in the same Greenplum Database system. The destination system must be a separate Greenplum Database cluster.

If you are copying data between Greenplum Database clusters having different versions, each cluster must have gpcopy installed locally. gpcopy is installed with Pivotal Greenplum Database starting with versions 5.9.0. For Greenplum Database 4.3.x, gpcopy is supported for versions 4.3.26.0 and later.

gpcopy transfers data from user databases only; the postgres, template0, and template1 databases cannot be transferred. Administrators must transfer configuration files manually and install extensions into the destination database with gppkg.

gpcopy cannot copy a row that is larger than 1GB in size.

gpcopy does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy operation and specify the --no-distribution-check option to disable checking of data distribution.

Warning: Before you perform a gpcopy operation with the --no-distribution-check option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.

When transferring data between databases, you can run only one instance of gpcopy at a time. Running multiple, concurrent instances of gpcopy is not supported.

Configuring Parallel Jobs

The degree of parallelism when running gpcopy is determined the option --jobs. The option controls the number processes that gpcopy runs in parallel. The default is 4. The range is from 1 to 64.

The --jobs value, n, produces 2*n+1 database connections. For example, the default --jobs value of 4 creates 9 connections.

If you increase this option, ensure that the Greenplum Database systems are configured with a sufficient maximum concurrent connection value to accommodate the gpcopy connections and any other concurrent connections (such as user connections) that you require. See the Greenplum Database server configuration parameter max_connections.

Validating Copied Data

By default, gpcopy does not validate the data transferred. You can request validation using the --validate=type option. You must include --validate=type if you specify the --truncate-source-after option. The validation type can be one of the following:
  • count - compares the row counts between the source and destination tables.
  • md5xor - validates by selecting all rows of the source and destination tables, converting all columns in a row to text, and then calculating the md5 value of each row. gpcopy then performs an XOR over the MD5 values to ensure that all rows were successfully copied for the table.
Note: Avoid using --append with either validation option. If you use --append and the source table already includes rows, then either validation method will fail due to the different number of rows in the source and destination tables.

Addressing Failed Data Transfers

When gpcopy encounters errors and quits or is cancelled by the user, current copy operations on tables in the destination database are rolled back. Copy operations that have completed are not rolled back.

If an error occurs during the process of copying a table, or table validation fails, gpcopy continues copying the other specified tables. After gpcopy finishes, it displays a list of tables where errors occurred or validation failed and displays a gpcopy command. You can use the provided command to retry copying the failed tables.

The gpcopy utility logs messages in log file gpcopy_date.log in the ~/gpAdminLogs directory on the master host. If you run multiple gpcopy commands on the same day, the utility appends messages to that day's log file.

After gpcopy completes, it displays a summary of the operations performed. If the utility fails to copy tables, they are highlighted in summary, and there is a gpcopy command provided in summary for user to just copy the failed tables. The information is displayed at the command prompt and in the gpcopy log file.

After resolving the issues that caused the copy operations to fail, you can run the provided command to copy the tables that failed in the previous gpcopy command.

Performing a Basic Data Migration

Follow this procedure to migrate data from one Greenplum Database system to another with gpcopy:
  1. Start both the source and destination clusters.
  2. Perform a full database backup in the source Greenplum Database system. See Parallel Backup with gpbackup and gprestore.
  3. While Greenplum Database 4.3.26 and later support migration with gpcopy, Greenplum Database 4.3.x does not include the gpcopy utility. If you are migrating from a version 4.3.x system, Pivotal recommends that you copy the utility from your Greenplum Database 5.x installation to the 4.3.x installation. For example:
    $ cp /usr/local/greenplum-db-5.8.0/bin/gpcopy /usr/local/greenplum-db-4.3.26.0/bin/
  4. As a best practice, source the greenplum_path.sh file in the source Greenplum Database 4.3.x installation, so that you execute gpcopy from the source system. For example:
    $ source /usr/local/greenplum-db-4.3.26.0/greenplum_path.sh
  5. Use gpcopy with the --full option to migrate your data to the destination system. A full migration automatically copies all database objects including tables, es, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop to drop any destination tables that may already exist (recreating es as necessary). For example:
    gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
        --dest-host demohost --dest-port 1234 --dest-user gpuser \
        --full --drop --validate count
    With the above command, the utility drops tables in the destination database (--drop option) and uses the row count of the source and destination tables to validate the data transfer (--validate count option). The other gpcopy options specify the source and destination Greenplum Database system master hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
    Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
  6. The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
  7. The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
  8. Greenplum Database 5.x removes automatic implicit casts between the text type and other data types. After you migrate from Greenplum Database version 4.3.x to version 5.x, this change in behavior may impact existing applications and queries. Refer to About Implicit Text Casting in Greenplum Database in the Greenplum Database Installation Guide for information, including a discussion about supported and unsupported workarounds.
  9. After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 5.x. Look for Upgrade Action Required entries in the Pivotal Greenplum 5.x Release Notes for features that may necessitate post-migration tasks.

See the gpcopy reference page for complete syntax and usage information.

Migrating Data Between Clusters that Share Hardware

In order to migrate data between two clusters on the same hardware, you should have enough free disk space to accommodate over 5 times the original data set. This enables you to maintain 2 full copies of the primary and mirror data sets (on the source and destination systems), plus the original backup data in ASCII format.

If you attempt to migrate on the same system but you run out of disk space, the gpcopy utility provides the --truncate-source-after option to help you complete the operation with only a minimum of free disk space. The --truncate-source-after option instructs the utility to truncate each source table after successfully copying the table data to the destination cluster and validating that the copy succeeded.

If you choose to use --truncate-source-after, consider the following:
  • Using the --truncate-source-after option does not allow for an easy rollback to the source system to its original condition if errors occur or validation checks fail during the gpcopy operation. Table errors or validation failures during the gpcopy operation can leave some tables remaining in the source cluster, while other tables may be empty (having been truncated after being copied to the new cluster). Back up all source data before using gpcopy with --truncate-source-after.
  • Migrating data with --truncate-source-after still requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy. For example, with a --jobs setting of 5, you must ensure that you have free space equal to the sum of the 5 largest tables copied in the batch. The procedure below provides sample commands to determine the largest table sizes.
  • You must use the --validate option with --truncate-source-after to ensure that data is successfully copied before source tables are truncated.
If you attempt to use the instructions in Performing a Basic Data Migration to migrate systems that use the same hardware, but you do not have the required free space:
  1. Start both the source and destination clusters.
  2. Perform a full database backup in the source Greenplum Database system. See Parallel Backup with gpbackup and gprestore.
  3. Determine if you have enough free space to migrate your data using --truncate-source-after. Migrating data "in-place" requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy. For example, if you want to use a --jobs setting of 5, ensure that you have free space equal to the sum of the 5 largest tables copied in the batch.
    The following query lists the largest 5 tables in your source system; modify the query as needed depending on the --jobs setting you intend to use:
    gpadmin=# SELECT n.nspname, c.relname, c.relstorage, pg_relation_size(c.oid)
    FROM 
    pg_class c JOIN pg_namespace n ON (c.relnamespace=n.oid)
    JOIN pg_catalog.gp_distribution_policy p ON (c.oid = p.localoid)
    WHERE
    n.nspname NOT IN ('gpexpand', 'pg_bitmap', 'information_schema', 'gp_toolkit')
    AND n.nspname NOT LIKE 'pg_temp_%%' AND c.relstorage <> 'v'
    ORDER BY 4 DESC LIMIT 5;
    

    Either free enough disk space to cover the sum of the table sizes shown in the above query, or consider using a smaller --jobs value to reduce the free space requirements.

  4. While Greenplum Database 4.3.26 and later support migration with gpcopy, Greenplum Database 4.3.x does not include the gpcopy utility. If you are migrating from a version 4.3.x system, Pivotal recommends that you copy the utility from your Greenplum Database 5.x installation to the 4.3.x installation. For example:
    $ cp /usr/local/greenplum-db-5.8.0/bin/gpcopy /usr/local/greenplum-db-4.3.26.0/bin/
  5. As a best practice, source the greenplum_path.sh file in the source Greenplum Database 4.3.x installation, so that you execute gpcopy from the source system. For example:
    $ source /usr/local/greenplum-db-4.3.26.0/greenplum_path.sh
  6. Use a gpcopy with the --truncate-source-after and --validate options to migrate your data to the destination system. A full migration automatically copies all database objects including tables, es, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop to drop any destination tables that may already exist (recreating es as necessary). The --truncate-source-after truncates each source table, only after copying and validating the table data in the destination system. For example:
    gpcopy --source-host my_host --source-port 1234 --source-user gpuser \
    --dest-host my_host --dest-port 1235 --dest-user gpuser --full --drop \
    -truncate-source-after --analyze --validate count
    The above command performs a full database copy, first dropping tables in the destination database (--drop option) if they already exist. gpcopy truncates each table in the source system only after successfully copying and validating the table data in the destination system. The other gpcopy options specify the source and destination Greenplum Database system master hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
    Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
  7. The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
  8. The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
  9. Greenplum Database 5.x removes automatic implicit casts between the text type and other data types. After you migrate from Greenplum Database version 4.3.x to version 5.x, this change in behavior may impact existing applications and queries. Refer to About Implicit Text Casting in Greenplum Database in the Greenplum Database Installation Guide for information, including a discussion about supported and unsupported workarounds.
  10. After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 5.x. Look for Upgrade Action Required entries in the Pivotal Greenplum 5.x Release Notes for features that may necessitate post-migration tasks.

See the gpcopy reference page for complete syntax and usage information.