Commit 93ece9cc authored by Tom Lane's avatar Tom Lane

Edit SGML documentation related to extended statistics.

Use the "statistics object" terminology uniformly here too.  Assorted
copy-editing.  Put new catalogs.sgml sections into alphabetical order.
parent e84c0195
This diff is collapsed.
This diff is collapsed.
...@@ -456,10 +456,11 @@ rows = (outer_cardinality * inner_cardinality) * selectivity ...@@ -456,10 +456,11 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
</indexterm> </indexterm>
<sect2> <sect2>
<title>Functional dependencies</title> <title>Functional Dependencies</title>
<para> <para>
Multivariate correlation can be seen with a very simple data set &mdash; a Multivariate correlation can be demonstrated with a very simple data set
table with two columns, both containing the same values: &mdash; a table with two columns, both containing the same values:
<programlisting> <programlisting>
CREATE TABLE t (a INT, b INT); CREATE TABLE t (a INT, b INT);
...@@ -501,8 +502,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1; ...@@ -501,8 +502,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
number of rows, we see that the estimate is very accurate number of rows, we see that the estimate is very accurate
(in fact exact, as the table is very small). Changing the (in fact exact, as the table is very small). Changing the
<literal>WHERE</> to use the <structfield>b</> column, an identical <literal>WHERE</> to use the <structfield>b</> column, an identical
plan is generated. Observe what happens if we apply the same plan is generated. But observe what happens if we apply the same
condition on both columns combining them with <literal>AND</>: condition on both columns, combining them with <literal>AND</>:
<programlisting> <programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
...@@ -514,15 +515,16 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; ...@@ -514,15 +515,16 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
</programlisting> </programlisting>
The planner estimates the selectivity for each condition individually, The planner estimates the selectivity for each condition individually,
arriving to the 1% estimates as above, and then multiplies them, getting arriving at the same 1% estimates as above. Then it assumes that the
the final 0.01% estimate. The <quote>actual</quote> figures, however, conditions are independent, and so it multiplies their selectivities,
show that this results in a significant underestimate, as the actual producing a final selectivity estimate of just 0.01%.
number of rows matching the conditions (100) is two orders of magnitude This is a significant underestimate, as the actual number of rows
higher than the estimated value. matching the conditions (100) is two orders of magnitude higher.
</para> </para>
<para> <para>
This problem can be fixed by applying functional-dependency This problem can be fixed by creating a statistics object that
directs <command>ANALYZE</> to calculate functional-dependency
multivariate statistics on the two columns: multivariate statistics on the two columns:
<programlisting> <programlisting>
...@@ -539,13 +541,15 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; ...@@ -539,13 +541,15 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
</sect2> </sect2>
<sect2> <sect2>
<title>Multivariate N-Distinct coefficients</title> <title>Multivariate N-Distinct Counts</title>
<para> <para>
A similar problem occurs with estimation of the cardinality of distinct A similar problem occurs with estimation of the cardinality of sets of
elements, used to determine the number of groups that would be generated multiple columns, such as the number of groups that would be generated by
by a <command>GROUP BY</command> clause. When <command>GROUP BY</command> a <command>GROUP BY</command> clause. When <command>GROUP BY</command>
lists a single column, the n-distinct estimate (which can be seen as the lists a single column, the n-distinct estimate (which is visible as the
number of rows returned by the aggregate execution node) is very accurate: estimated number of rows returned by the HashAggregate node) is very
accurate:
<programlisting> <programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a; EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN QUERY PLAN
...@@ -565,8 +569,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; ...@@ -565,8 +569,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
Group Key: a, b Group Key: a, b
-&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1) -&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
</programlisting> </programlisting>
By dropping the existing statistics and re-creating it to include n-distinct By redefining the statistics object to include n-distinct counts for the
calculation, the estimate is much improved: two columns, the estimate is much improved:
<programlisting> <programlisting>
DROP STATISTICS stts; DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t; CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
......
...@@ -79,11 +79,13 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na ...@@ -79,11 +79,13 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<term><replaceable class="PARAMETER">statistic_type</replaceable></term> <term><replaceable class="PARAMETER">statistic_type</replaceable></term>
<listitem> <listitem>
<para> <para>
A statistic type to be computed in this statistics object. Currently A statistic type to be computed in this statistics object.
supported types are <literal>ndistinct</literal>, which enables Currently supported types are
n-distinct coefficient tracking, <literal>ndistinct</literal>, which enables n-distinct statistics, and
and <literal>dependencies</literal>, which enables functional <literal>dependencies</literal>, which enables functional
dependencies. dependency statistics.
For more information, see <xref linkend="planner-stats-extended">
and <xref linkend="multivariate-statistics-examples">.
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>
...@@ -92,7 +94,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na ...@@ -92,7 +94,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<term><replaceable class="PARAMETER">column_name</replaceable></term> <term><replaceable class="PARAMETER">column_name</replaceable></term>
<listitem> <listitem>
<para> <para>
The name of a table column to be included in the statistics object. The name of a table column to be covered by the computed statistics.
At least two column names must be given.
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>
...@@ -114,7 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na ...@@ -114,7 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<title>Notes</title> <title>Notes</title>
<para> <para>
You must be the owner of a table to create or change statistics on it. You must be the owner of a table to create a statistics object
reading it. Once created, however, the ownership of the statistics
object is independent of the underlying table(s).
</para> </para>
</refsect1> </refsect1>
...@@ -124,8 +129,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na ...@@ -124,8 +129,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<para> <para>
Create table <structname>t1</> with two functionally dependent columns, i.e. Create table <structname>t1</> with two functionally dependent columns, i.e.
knowledge of a value in the first column is sufficient for determining the knowledge of a value in the first column is sufficient for determining the
value in the other column. Then functional dependencies are built on those value in the other column. Then functional dependency statistics are built
columns: on those columns:
<programlisting> <programlisting>
CREATE TABLE t1 ( CREATE TABLE t1 (
...@@ -136,21 +141,25 @@ CREATE TABLE t1 ( ...@@ -136,21 +141,25 @@ CREATE TABLE t1 (
INSERT INTO t1 SELECT i/100, i/500 INSERT INTO t1 SELECT i/100, i/500
FROM generate_series(1,1000000) s(i); FROM generate_series(1,1000000) s(i);
ANALYZE t1;
-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
CREATE STATISTICS s1 (dependencies) ON a, b FROM t1; CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
ANALYZE t1; ANALYZE t1;
-- valid combination of values -- now the rowcount estimate is more accurate:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0); EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
-- invalid combination of values
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
</programlisting> </programlisting>
Without functional-dependency statistics, the planner would make the Without functional-dependency statistics, the planner would assume
same estimate of the number of matching rows for these two queries. that the two <literal>WHERE</> conditions are independent, and would
With such statistics, it is able to tell that one case has matches multiply their selectivities together to arrive at a much-too-small
and the other does not. rowcount estimate.
With such statistics, the planner recognizes that the <literal>WHERE</>
conditions are redundant and does not underestimate the rowcount.
</para> </para>
</refsect1> </refsect1>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment