Commit 93ece9cc authored by Tom Lane's avatar Tom Lane

Edit SGML documentation related to extended statistics.

Use the "statistics object" terminology uniformly here too.  Assorted
copy-editing.  Put new catalogs.sgml sections into alphabetical order.
parent e84c0195
This diff is collapsed.
This diff is collapsed.
......@@ -456,10 +456,11 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
</indexterm>
<sect2>
<title>Functional dependencies</title>
<title>Functional Dependencies</title>
<para>
Multivariate correlation can be seen with a very simple data set &mdash; a
table with two columns, both containing the same values:
Multivariate correlation can be demonstrated with a very simple data set
&mdash; a table with two columns, both containing the same values:
<programlisting>
CREATE TABLE t (a INT, b INT);
......@@ -501,8 +502,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
number of rows, we see that the estimate is very accurate
(in fact exact, as the table is very small). Changing the
<literal>WHERE</> to use the <structfield>b</> column, an identical
plan is generated. Observe what happens if we apply the same
condition on both columns combining them with <literal>AND</>:
plan is generated. But observe what happens if we apply the same
condition on both columns, combining them with <literal>AND</>:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
......@@ -514,15 +515,16 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
</programlisting>
The planner estimates the selectivity for each condition individually,
arriving to the 1% estimates as above, and then multiplies them, getting
the final 0.01% estimate. The <quote>actual</quote> figures, however,
show that this results in a significant underestimate, as the actual
number of rows matching the conditions (100) is two orders of magnitude
higher than the estimated value.
arriving at the same 1% estimates as above. Then it assumes that the
conditions are independent, and so it multiplies their selectivities,
producing a final selectivity estimate of just 0.01%.
This is a significant underestimate, as the actual number of rows
matching the conditions (100) is two orders of magnitude higher.
</para>
<para>
This problem can be fixed by applying functional-dependency
This problem can be fixed by creating a statistics object that
directs <command>ANALYZE</> to calculate functional-dependency
multivariate statistics on the two columns:
<programlisting>
......@@ -539,13 +541,15 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
</sect2>
<sect2>
<title>Multivariate N-Distinct coefficients</title>
<title>Multivariate N-Distinct Counts</title>
<para>
A similar problem occurs with estimation of the cardinality of distinct
elements, used to determine the number of groups that would be generated
by a <command>GROUP BY</command> clause. When <command>GROUP BY</command>
lists a single column, the n-distinct estimate (which can be seen as the
number of rows returned by the aggregate execution node) is very accurate:
A similar problem occurs with estimation of the cardinality of sets of
multiple columns, such as the number of groups that would be generated by
a <command>GROUP BY</command> clause. When <command>GROUP BY</command>
lists a single column, the n-distinct estimate (which is visible as the
estimated number of rows returned by the HashAggregate node) is very
accurate:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN
......@@ -565,8 +569,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
Group Key: a, b
-&gt; Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
</programlisting>
By dropping the existing statistics and re-creating it to include n-distinct
calculation, the estimate is much improved:
By redefining the statistics object to include n-distinct counts for the
two columns, the estimate is much improved:
<programlisting>
DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
......
......@@ -79,11 +79,13 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<term><replaceable class="PARAMETER">statistic_type</replaceable></term>
<listitem>
<para>
A statistic type to be computed in this statistics object. Currently
supported types are <literal>ndistinct</literal>, which enables
n-distinct coefficient tracking,
and <literal>dependencies</literal>, which enables functional
dependencies.
A statistic type to be computed in this statistics object.
Currently supported types are
<literal>ndistinct</literal>, which enables n-distinct statistics, and
<literal>dependencies</literal>, which enables functional
dependency statistics.
For more information, see <xref linkend="planner-stats-extended">
and <xref linkend="multivariate-statistics-examples">.
</para>
</listitem>
</varlistentry>
......@@ -92,7 +94,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<term><replaceable class="PARAMETER">column_name</replaceable></term>
<listitem>
<para>
The name of a table column to be included in the statistics object.
The name of a table column to be covered by the computed statistics.
At least two column names must be given.
</para>
</listitem>
</varlistentry>
......@@ -114,7 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<title>Notes</title>
<para>
You must be the owner of a table to create or change statistics on it.
You must be the owner of a table to create a statistics object
reading it. Once created, however, the ownership of the statistics
object is independent of the underlying table(s).
</para>
</refsect1>
......@@ -124,8 +129,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
<para>
Create table <structname>t1</> with two functionally dependent columns, i.e.
knowledge of a value in the first column is sufficient for determining the
value in the other column. Then functional dependencies are built on those
columns:
value in the other column. Then functional dependency statistics are built
on those columns:
<programlisting>
CREATE TABLE t1 (
......@@ -136,21 +141,25 @@ CREATE TABLE t1 (
INSERT INTO t1 SELECT i/100, i/500
FROM generate_series(1,1000000) s(i);
ANALYZE t1;
-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
ANALYZE t1;
-- valid combination of values
-- now the rowcount estimate is more accurate:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
-- invalid combination of values
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
</programlisting>
Without functional-dependency statistics, the planner would make the
same estimate of the number of matching rows for these two queries.
With such statistics, it is able to tell that one case has matches
and the other does not.
Without functional-dependency statistics, the planner would assume
that the two <literal>WHERE</> conditions are independent, and would
multiply their selectivities together to arrive at a much-too-small
rowcount estimate.
With such statistics, the planner recognizes that the <literal>WHERE</>
conditions are redundant and does not underestimate the rowcount.
</para>
</refsect1>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment