Collecting Root Partition Statistics

Collecting Root Partition Statistics

For a partitioned table, GPORCA uses statistics of the table root partition to generate query plans. These statistics are used for determining the join order, for splitting and joining aggregate nodes, and for costing the query steps. In contrast, the legacy planner uses the statistics of each leaf partition.

If you execute queries on partitioned tables, you should collect statistics on the root partition and periodically update those statistics to ensure that GPORCA can generate optimal query plans. If the root partition statistics are not up-to-date or do not exist, GPORCA still performs dynamic partition elimination for queries against the table. However, the query plan might not be optimal.

Running ANALYZE

By default, running the ANALYZE command on the root partition of a partitioned table samples the leaf partition data in the table, and stores the statistics for the root partition. ANALYZE collects statistics on the root and leaf partitions, including HyperLogLog (HLL) statistics on the leaf partitions. ANALYZE ROOTPARTITION collects statistics only on the root partition. The server configuration parameter optimizer_analyze_root_partition controls whether the ROOTPARTITION keyword is required to collect root statistics for the root partition of a partitioned table. See the ANALYZE command for information about collecting statistics on partitioned tables.

Keep in mind that ANALYZE always scans the entire table before updating the root partition statistics. If your table is very large, this operation can take a significant amount of time. ANALYZE ROOTPARTITION also uses an ACCESS SHARE lock that prevents certain operations, such as TRUNCATE and VACUUM operations, during execution. For these reasons, you should schedule ANALYZE operations periodically, or when there are significant changes to leaf partition data.

Follow these best practices for running ANALYZE or ANALYZE ROOTPARTITION on partitioned tables in your system:
  • Run ANALYZE <root_partition> on a new partitioned table after adding initial data. Run ANALYZE <leaf_partition> on a new leaf partition or a leaf partition where data has changed. By default, running the command on a leaf partition updates the root partition statistics if the other leaf partitions have statistics.
  • Update root partition statistics when you observe query performance regression in EXPLAIN plans against the table, or after significant changes to leaf partition data. For example, if you add a new leaf partition at some point after generating root partition statistics, consider running ANALYZE or ANALYZE ROOTPARTITION to update root partition statistics with the new tuples inserted from the new leaf partition.
  • For very large tables, run ANALYZE or ANALYZE ROOTPARTITION only weekly, or at some interval longer than daily.
  • Avoid running ANALYZE with no arguments, because doing so executes the command on all database tables including partitioned tables. With large databases, these global ANALYZE operations are difficult to monitor, and it can be difficult to predict the time needed for completion.
  • Consider running multiple ANALYZE <table_name> or ANALYZE ROOTPARTITION <table_name> operations in parallel to speed the operation of statistics collection, if your I/O throughput can support the load.
  • You can also use the Greenplum Database utility analyzedb to update table statistics. Using analyzedb ensures that tables that were previously analzyed are not re-analyzed if no modifications were made to the leaf partition.

GPORCA and Leaf Partition Statistics

Although creating and maintaining root partition statistics is crucial for GPORCA query performance with partitioned tables, maintaining leaf partition statistics is also important. If GPORCA cannot generate a plan for a query against a partitioned table, then the legacy planner is used and leaf partition statistics are needed to produce the optimal plan for that query.

GPORCA itself also uses leaf partition statistics for any queries that access leaf partitions directly, instead of using the root partition with predicates to eliminate partitions. For example, if you know which partitions hold necessary tuples for a query, you can directly query the leaf partition table itself; in this case GPORCA uses the leaf partition statistics.

Disabling Automatic Root Partition Statistics Collection

If you do not intend to execute queries on partitioned tables with GPORCA (setting the server configuration parameter optimizer to off), then you can disable the automatic collection of statistics on the root partition of the partitioned table. The server configuration parameter optimizer_analyze_root_partition controls whether the ROOTPARTITION keyword is required to collect root statistics for the root partition of a partitioned table. The default setting for the parameter is on, the ANALYZE command can collect root partition statistics without the ROOTPARTITION keyword. You can disable automatic collection of root partition statistics by setting the parameter to off. When the value is off, you must run ANALZYE ROOTPARTITION to collect root partition statistics.

  1. Log into the Greenplum Database master host as gpadmin, the Greenplum Database administrator.
  2. Set the values of the server configuration parameters. These Greenplum Database gpconfig utility commands sets the value of the parameters to off:
    $ gpconfig -c optimizer_analyze_root_partition -v off --masteronly
  3. Restart Greenplum Database. This Greenplum Database gpstop utility command reloads the postgresql.conf files of the master and segments without shutting down Greenplum Database.
    gpstop -u