Commit 87779aa4 authored by Dean Rasheed's avatar Dean Rasheed

Prevent functional dependency estimates from exceeding column estimates.

Formerly we applied a functional dependency "a => b with dependency
degree f" using the formula

  P(a,b) = P(a) * [f + (1-f)*P(b)]

This leads to the possibility that the combined selectivity P(a,b)
could exceed P(b), which is not ideal. The addition of support for IN
and OR clauses (commits 8f321bd1 and ccaa3569) would seem to make
this more likely, since the user-supplied values in such clauses are
not necessarily compatible with the functional dependency.

Mitigate this by using the formula

  P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

instead, which guarantees that the combined selectivity is less than
each column's individual selectivity. Logically, this is modifies the
part of the formula that accounts for dependent rows to handle cases
where P(a) > P(b), whilst not changing the second term which accounts
for independent rows.

Additionally, this refactors the way that functional dependencies are
applied, so now dependencies_clauselist_selectivity() estimates both
the implying clauses and the implied clauses for each functional
dependency (formerly only the implied clauses were estimated), and now
all clauses for each attribute are taken into account (formerly only
one clause for each implied attribute was estimated). This removes the
previously built-in assumption that only equality clauses will be
seen, which is no longer true, and opens up the possibility of
applying functional dependencies to more general clauses.

Patch by me, reviewed by Tomas Vondra.

Discussion: https://postgr.es/m/CAEZATCXaNFZyOhR4XXAfkvj1tibRBEjje6ZbXwqWUB_tqbH%3Drw%40mail.gmail.com
Discussion: https://postgr.es/m/20200318002946.6dvblukm3cfmgir2%40development
parent 145cb16d
This diff is collapsed.
......@@ -440,6 +440,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
8 | 200
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
estimated | actual
-----------+--------
4 | 100
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
estimated | actual
-----------+--------
......@@ -600,6 +606,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
200 | 200
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
estimated | actual
-----------+--------
100 | 100
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
estimated | actual
-----------+--------
......@@ -719,12 +731,14 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
1 | 0
(1 row)
-- check change of column type doesn't break it
-- changing the type of column c causes its single-column stats to be dropped,
-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
-- clauses estimated with functional dependencies does not exceed this
ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
estimated | actual
-----------+--------
50 | 50
25 | 50
(1 row)
ANALYZE functional_dependencies;
......
......@@ -280,6 +280,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
......@@ -342,6 +344,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
......@@ -385,7 +389,9 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
-- check change of column type doesn't break it
-- changing the type of column c causes its single-column stats to be dropped,
-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
-- clauses estimated with functional dependencies does not exceed this
ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment