Collecting Root Partition Statistics

Collecting Root Partition Statistics

For a partitioned table, GPORCA uses statistics of the table root partition to generate query plans. These statistics are used for determining the join order, for splitting and joining aggregate nodes, and for costing the query steps. This is in contrast to the legacy planner, which uses the statistics of each leaf partition. If you execute queries on partitioned tables, you must collect statistics on the root partition and periodically update those statistics to ensure that GPORCA can generate optimal query plans.

Setting the optimizer_analyze_root_partition Parameter

A primary way to ensure that you collect root partition statistics is to set the optimizer_analyze_root_partition configuration parameter to on. When the parameter is enabled, Greenplum Database collects root partition statistics (ANALYZE ROOTPARTITION operation) any time you run an ANALYZE command against a partitioned table.

  1. Log into the Greenplum Database master host as gpadmin, the Greenplum Database administrator.
  2. Set the values of the server configuration parameters. These Greenplum Database gpconfig utility commands sets the value of the parameters to on:
    $ gpconfig -c optimizer_analyze_root_partition -v on --masteronly
  3. Restart Greenplum Database. This Greenplum Database gpstop utility command reloads the postgresql.conf files of the master and segments without shutting down Greenplum Database.
    gpstop -u

Running ANALYZE ROOTPARTITION

The ANALYZE ROOTPARTITION command samples the leaf partition data in a table, and stores the statistics in the root partition. The leaf partition data is only sampled, but statistics are not collected on the leaf partitions. Keep in mind that ANALYZE ROOTPARTITION always scans the entire table before updating the root partition statistics. If your table is very large, this operation can take a significant amount of time. ANALYZE ROOTPARTITION also uses an ACCESS SHARE lock that prevents certain operations, such as TRUNCATE and VACUUM operations, during execution. For these reason, you should schedule ANALYZE ROOTPARTITION operations periodically, or when there are significant changes to leaf partition data.

Keep in mind that even if the root partition statistics are not up-to-date or do not exist, GPORCA will still perform dynamic partition elimination for queries against the table.

Follow these best practices for running ANALYZE ROOTPARTITION in your system:
  • Always run ANALYZE ROOTPARTITION at least once for each newly-created partitioned table, after adding initial data. If you execute a query on a root partition that has no statistics but has tuples, you receive the notice:
    NOTICE: One or more columns in the following table(s) do not have statistics: <table_name>
    HINT: For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, 
    run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
  • Update root partition statistics when you observe query performance regression in EXPLAIN plans against the table, or after significant changes to leaf partition data. For example, if you add a new leaf partition at some point after generating root partition statistics, consider running ANALYZE ROOTPARTITION to update root partition statistics with the new tuples inserted from the new leaf partition.
  • For very large tables, run ANALYZE ROOTPARTITION only weekly, or at some interval longer than daily.
  • Avoid running ANALYZE with no arguments, because doing so executes the command on all database tables including partitioned tables. With large databases, these global ANALYZE operations are difficult to monitor, and it can be difficult to predict the time needed for completion.
  • Consider running multiple ANALYZE ROOTPARTITION <table_name> operations in parallel to speed the operation of statistics collection, if your I/O throughput can support the load.
  • You can also use the Greenplum Database utility analyzedb to update table statistics. Using analyzedb ensures that tables that were previously analzyed are not re-analyzed if no modifications were made to the leaf partition.

GPORCA and Leaf Partition Statistics

Although creating and maintaining root partition statistics is crucial for GPORCA query performance with partitioned tables, maintaining leaf partition statistics is also important. If GPORCA cannot generate a plan for a query against a partitioned table, then the legacy planner is used and leaf partition statistics are needed to produce the optimal plan for that query.

GPORCA itself also uses leaf partition statistics for any queries that access leaf partitions directly, instead of using the root partition with predicates to eliminate partitions. For example, if you know which partitions hold necessary tuples for a query, you can directly query the leaf partition table itself; in this case GPORCA uses the leaf partition statistics.