Edit SGML documentation related to extended statistics.

Use the "statistics object" terminology uniformly here too. Assorted copy-editing. Put new catalogs.sgml sections into alphabetical order.

Edit SGML documentation related to extended statistics.
Use the "statistics object" terminology uniformly here too. Assorted copy-editing. Put new catalogs.sgml sections into alphabetical order.
93ece9cc · Tom Lane · e84c0195 · 93ece9cc · 93ece9cc · 93ece9cc
Commit 93ece9cc authored May 14, 2017 by Tom Lane
4 changed files
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -456,10 +456,11 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
  </indexterm>
  <sect2>
-   <title>Functional dependencies</title>
+   <title>Functional Dependencies</title>
   <para>
-    Multivariate correlation can be seen with a very simple data set &mdash; a
+    Multivariate correlation can be demonstrated with a very simple data set
-    table with two columns, both containing the same values:
+    &mdash; a table with two columns, both containing the same values:
 <programlisting>
 CREATE TABLE t (a INT, b INT);
@@ -501,8 +502,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
    number of rows, we see that the estimate is very accurate
    (in fact exact, as the table is very small).  Changing the
    <literal>WHERE</> to use the <structfield>b</> column, an identical
-    plan is generated.  Observe what happens if we apply the same
+    plan is generated.  But observe what happens if we apply the same
-    condition on both columns combining them with <literal>AND</>:
+    condition on both columns, combining them with <literal>AND</>:
 <programlisting>
 EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
@@ -514,15 +515,16 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
 </programlisting>
    The planner estimates the selectivity for each condition individually,
-    arriving to the 1% estimates as above, and then multiplies them, getting
+    arriving at the same 1% estimates as above.  Then it assumes that the
-    the final 0.01% estimate. The <quote>actual</quote> figures, however,
+    conditions are independent, and so it multiplies their selectivities,
-    show that this results in a significant underestimate, as the actual
+    producing a final selectivity estimate of just 0.01%.
-    number of rows matching the conditions (100) is two orders of magnitude
+    This is a significant underestimate, as the actual number of rows
-    higher than the estimated value.
+    matching the conditions (100) is two orders of magnitude higher.
   </para>
   <para>
-    This problem can be fixed by applying functional-dependency
+    This problem can be fixed by creating a statistics object that
+    directs <command>ANALYZE</> to calculate functional-dependency
    multivariate statistics on the two columns:
 <programlisting>
@@ -539,13 +541,15 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  </sect2>
  <sect2>
-   <title>Multivariate N-Distinct coefficients</title>
+   <title>Multivariate N-Distinct Counts</title>
   <para>
-    A similar problem occurs with estimation of the cardinality of distinct
+    A similar problem occurs with estimation of the cardinality of sets of
-    elements, used to determine the number of groups that would be generated
+    multiple columns, such as the number of groups that would be generated by
-    by a <command>GROUP BY</command> clause.  When <command>GROUP BY</command>
+    a <command>GROUP BY</command> clause.  When <command>GROUP BY</command>
-    lists a single column, the n-distinct estimate (which can be seen as the
+    lists a single column, the n-distinct estimate (which is visible as the
-    number of rows returned by the aggregate execution node) is very accurate:
+    estimated number of rows returned by the HashAggregate node) is very
+    accurate:
 <programlisting>
 EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
                                       QUERY PLAN                                        
@@ -565,8 +569,8 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
   Group Key: a, b
   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
 </programlisting>
-    By dropping the existing statistics and re-creating it to include n-distinct
+    By redefining the statistics object to include n-distinct counts for the
-    calculation, the estimate is much improved:
+    two columns, the estimate is much improved:
 <programlisting>
 DROP STATISTICS stts;
 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;

--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -79,11 +79,13 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
    <term><replaceable class="PARAMETER">statistic_type</replaceable></term>
    <listitem>
     <para>
-      A statistic type to be computed in this statistics object.  Currently
+      A statistic type to be computed in this statistics object.
-      supported types are <literal>ndistinct</literal>, which enables
+      Currently supported types are
-      n-distinct coefficient tracking,
+      <literal>ndistinct</literal>, which enables n-distinct statistics, and
-      and <literal>dependencies</literal>, which enables functional
+      <literal>dependencies</literal>, which enables functional
-      dependencies.
+      dependency statistics.
+      For more information, see <xref linkend="planner-stats-extended">
+      and <xref linkend="multivariate-statistics-examples">.
     </para>
    </listitem>
   </varlistentry>
@@ -92,7 +94,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
    <term><replaceable class="PARAMETER">column_name</replaceable></term>
    <listitem>
     <para>
-      The name of a table column to be included in the statistics object.
+      The name of a table column to be covered by the computed statistics.
+      At least two column names must be given.
     </para>
    </listitem>
   </varlistentry>
@@ -114,7 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
  <title>Notes</title>
  <para>
-   You must be the owner of a table to create or change statistics on it.
+   You must be the owner of a table to create a statistics object
+   reading it.  Once created, however, the ownership of the statistics
+   object is independent of the underlying table(s).
  </para>
 </refsect1>
@@ -124,8 +129,8 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
  <para>
   Create table <structname>t1</> with two functionally dependent columns, i.e.
   knowledge of a value in the first column is sufficient for determining the
-   value in the other column. Then functional dependencies are built on those
+   value in the other column. Then functional dependency statistics are built
-   columns:
+   on those columns:
 <programlisting>
 CREATE TABLE t1 (
@@ -136,21 +141,25 @@ CREATE TABLE t1 (
 INSERT INTO t1 SELECT i/100, i/500
                 FROM generate_series(1,1000000) s(i);
+ANALYZE t1;
+-- the number of matching rows will be drastically underestimated:
+EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
 CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
 ANALYZE t1;
-- valid combination of values
+-- now the rowcount estimate is more accurate:
 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
-- invalid combination of values
-EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
 </programlisting>
-   Without functional-dependency statistics, the planner would make the
+   Without functional-dependency statistics, the planner would assume
-   same estimate of the number of matching rows for these two queries.
+   that the two <literal>WHERE</> conditions are independent, and would
-   With such statistics, it is able to tell that one case has matches
+   multiply their selectivities together to arrive at a much-too-small
-   and the other does not.
+   rowcount estimate.
+   With such statistics, the planner recognizes that the <literal>WHERE</>
+   conditions are redundant and does not underestimate the rowcount.
  </para>
 </refsect1>