Collecting Root Partition Statistics
For a partitioned table, GPORCA uses statistics of the table root partition to generate query plans. These statistics are used for determining the join order, for splitting and joining aggregate nodes, and for costing the query steps. In contrast, the Postgres Planner uses the statistics of each leaf partition.
If you run queries on partitioned tables, you should collect statistics on the root partition and periodically update those statistics to ensure that GPORCA can generate optimal query plans. If the root partition statistics are not up-to-date or do not exist, GPORCA still performs dynamic partition elimination for queries against the table. However, the query plan might not be optimal.
Parent topic:About GPORCA
By default, running the
ANALYZE command on the root partition of a partitioned table samples the leaf partition data in the table, and stores the statistics for the root partition.
ANALYZE collects statistics on the root and leaf partitions, including HyperLogLog (HLL) statistics on the leaf partitions.
ANALYZE ROOTPARTITION collects statistics only on the root partition. The server configuration parameter optimizer_analyze_root_partition controls whether the
ROOTPARTITION keyword is required to collect root statistics for the root partition of a partitioned table. See the ANALYZE command for information about collecting statistics on partitioned tables.
Keep in mind that
ANALYZE always scans the entire table before updating the root partition statistics. If your table is very large, this operation can take a significant amount of time.
ANALYZE ROOTPARTITION also uses an
ACCESS SHARE lock that prevents certain operations, such as
VACUUM operations, during runtime. For these reasons, you should schedule
ANALYZE operations periodically, or when there are significant changes to leaf partition data.
Follow these best practices for running
ANALYZE ROOTPARTITION on partitioned tables in your system:
ANALYZE <root_partition_table_name>on a new partitioned table after adding initial data. Run
ANALYZE <leaf_partition_table_name>on a new leaf partition or a leaf partition where data has changed. By default, running the command on a leaf partition updates the root partition statistics if the other leaf partitions have statistics.
- Update root partition statistics when you observe query performance regression in
EXPLAINplans against the table, or after significant changes to leaf partition data. For example, if you add a new leaf partition at some point after generating root partition statistics, consider running
ANALYZE ROOTPARTITIONto update root partition statistics with the new tuples inserted from the new leaf partition.
- For very large tables, run
ANALYZE ROOTPARTITIONonly weekly, or at some interval longer than daily.
- Avoid running
ANALYZEwith no arguments, because doing so runs the command on all database tables including partitioned tables. With large databases, these global
ANALYZEoperations are difficult to monitor, and it can be difficult to predict the time needed for completion.
- Consider running multiple
ANALYZE ROOTPARTITION <table_name>operations in parallel to speed the operation of statistics collection, if your I/O throughput can support the load.
- You can also use the Greenplum Database utility
analyzedbto update table statistics. Using
analyzedbensures that tables that were previously analyzed are not re-analyzed if no modifications were made to the leaf partition.
Although creating and maintaining root partition statistics is crucial for GPORCA query performance with partitioned tables, maintaining leaf partition statistics is also important. If GPORCA cannot generate a plan for a query against a partitioned table, then the Postgres Planner is used and leaf partition statistics are needed to produce the optimal plan for that query.
GPORCA itself also uses leaf partition statistics for any queries that access leaf partitions directly, instead of using the root partition with predicates to eliminate partitions. For example, if you know which partitions hold necessary tuples for a query, you can directly query the leaf partition table itself; in this case GPORCA uses the leaf partition statistics.
If you do not intend to run queries on partitioned tables with GPORCA (setting the server configuration parameter optimizer to
off), then you can disable the automatic collection of statistics on the root partition of the partitioned table. The server configuration parameter optimizer_analyze_root_partition controls whether the
ROOTPARTITION keyword is required to collect root statistics for the root partition of a partitioned table. The default setting for the parameter is
ANALYZE command can collect root partition statistics without the
ROOTPARTITION keyword. You can disable automatic collection of root partition statistics by setting the parameter to
off. When the value is
off, you must run
ANALZYE ROOTPARTITION to collect root partition statistics.
- Log into the Greenplum Database master host as
gpadmin, the Greenplum Database administrator.
Set the values of the server configuration parameters. These Greenplum Database
gpconfigutility commands sets the value of the parameters to
$ gpconfig -c optimizer_analyze_root_partition -v off --masteronly
Restart Greenplum Database. This Greenplum Database
gpstoputility command reloads the
postgresql.conffiles of the master and segments without shutting down Greenplum Database.