Updating Statistics with ANALYZE
A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.
Updating Statistics with ANALYZE
The most important prerequisite for good query performance is to begin with accurate statistics for the tables. Updating stastistics with the ANALYZE statement enables the query planner to generate optimal query plans. When a table is analyzed, information about the data is stored in the system catalog tables. If the stored information is out of date, the planner can generate inefficient plans.
Generating Statistics Selectively
Running ANALYZE with no arguments updates statistics for all tables in the database. This can be a very long-running process and it is not recommended. You should ANALYZE tables selectively when data has changed or use the analyzedb utility.
Running ANALYZE on a large table can take a long time. If it is not feasible to run ANALYZE on all columns of a very large table, you can generate statistics for selected columns only using ANALYZE table(column, ...). Be sure to include columns used in joins, WHERE clauses, SORT clauses, GROUP BY clauses, or HAVING clauses.
SELECT partitiontablename from pg_partitions WHERE tablename='parent_table;
Improving Statistics Quality
There is a trade-off between the amount of time it takes to generate statistics and the quality, or accuracy, of the statistics.
To allow large tables to be analyzed in a reasonable amount of time, ANALYZE takes a random sample of the table contents, rather than examining every row. To increase the number of sample values for all table columns adjust the default_statistics_target configuration parameter. The target value ranges from 1 to 1000; the default target value is 100. The default_statistics_target variable applies to all columns by default, and specifies the number of values that are stored in the list of common values. A larger target may improve the quality of the query planner’s estimates, especially for columns with irregular data patterns. default_statistics_target can be set at the master/session level and requires a reload.
The gp_analyze_relative_error configuration parameter affects the sampling rate during statistics collection to determine cardinality in a column. For example, a value of .5 is equivalent to an acceptable error of 50%. The default is .25. Use the gp_analyze_relative_error parameter to set the acceptable estimated relative error in the cardinality of a table. If statistics do not produce good estimates of cardinality for a particular table attribute, decreasing the relative error fraction (accepting less errors) tells the system to sample more rows. However, it is not recommended to reduce this below 0.1 as it will increase ANALYZE time substantially.
When to Run ANALYZE
- after loading data,
- after CREATE INDEX operations,
- and after INSERT, UPDATE, and DELETE operations that significantly change the underlying data.
Configuring Automatic Statistics Collection
The gp_autostats_mode configuration parameter, together with the gp_autostats_on_change_threshold parameter, determines when an automatic analyze operation is triggered. When automatic statistics collection is triggered, the planner adds an ANALYZE step to the query.
By default, gp_autostats_mode is on_no_stats, which triggers statistics collection for CREATE TABLE AS SELECT, INSERT, or COPY operations on any table that has no existing statistics.
Setting gp_autostats_mode to on_change triggers statistics collection only when the number of rows affected exceeds the threshold defined by gp_autostats_on_change_threshold, which has a default value of 2147483647. Operations that can trigger automatic statistics collection with on_change are: CREATE TABLE AS SELECT, UPDATE, DELETE, INSERT, and COPY.
Setting gp_autostats_mode to none disables automatics statistics collection.
For partitioned tables, automatic statistics collection is not triggered if data is inserted from the top-level parent table of a partitioned table. But automatic statistics collection is triggered if data is inserted directly in a leaf table (where the data is stored) of the partitioned table.