Commit be4b4dc7 authored by Tom Lane's avatar Tom Lane

Omit null rows when applying the Haas-Stokes estimator for ndistinct.

Previously, we included null rows in the values of n and N that went
into the formula, which amounts to considering null as a value in its
own right; but the d and f1 values do not include nulls.  This is
inconsistent, and it contributes to significant underestimation of
ndistinct when the column is mostly nulls.  In any case stadistinct
is defined as the number of distinct non-null values, so we should
exclude nulls when doing this computation.

This is an aboriginal bug in our application of the Haas-Stokes formula,
but we'll refrain from back-patching for fear of destabilizing plan
choices in released branches.

While at it, make the code a bit more readable by omitting unnecessary
casts and intermediate variables.

Observation and original patch by Tomas Vondra, adjusted to fix both
uses of the formula by Alex Shulgin, cosmetic improvements by me
parent 82c83b33
...@@ -2072,6 +2072,12 @@ compute_distinct_stats(VacAttrStatsP stats, ...@@ -2072,6 +2072,12 @@ compute_distinct_stats(VacAttrStatsP stats,
* recommend are considerably more complex, and are numerically * recommend are considerably more complex, and are numerically
* very unstable when n is much smaller than N. * very unstable when n is much smaller than N.
* *
* In this calculation, we consider only non-nulls. We used to
* include rows with null values in the n and N counts, but that
* leads to inaccurate answers in columns with many nulls, and
* it's intuitively bogus anyway considering the desired result is
* the number of distinct non-null values.
*
* We assume (not very reliably!) that all the multiply-occurring * We assume (not very reliably!) that all the multiply-occurring
* values are reflected in the final track[] list, and the other * values are reflected in the final track[] list, and the other
* nonnull values all appeared but once. (XXX this usually * nonnull values all appeared but once. (XXX this usually
...@@ -2081,21 +2087,22 @@ compute_distinct_stats(VacAttrStatsP stats, ...@@ -2081,21 +2087,22 @@ compute_distinct_stats(VacAttrStatsP stats,
*/ */
int f1 = nonnull_cnt - summultiple; int f1 = nonnull_cnt - summultiple;
int d = f1 + nmultiple; int d = f1 + nmultiple;
double numer, double n = samplerows - null_cnt;
denom, double N = totalrows * (1.0 - stats->stanullfrac);
stadistinct; double stadistinct;
numer = (double) samplerows *(double) d;
denom = (double) (samplerows - f1) + /* N == 0 shouldn't happen, but just in case ... */
(double) f1 *(double) samplerows / totalrows; if (N > 0)
stadistinct = (n * d) / ((n - f1) + f1 * n / N);
else
stadistinct = 0;
stadistinct = numer / denom;
/* Clamp to sane range in case of roundoff error */ /* Clamp to sane range in case of roundoff error */
if (stadistinct < (double) d) if (stadistinct < d)
stadistinct = (double) d; stadistinct = d;
if (stadistinct > totalrows) if (stadistinct > N)
stadistinct = totalrows; stadistinct = N;
/* And round to integer */
stats->stadistinct = floor(stadistinct + 0.5); stats->stadistinct = floor(stadistinct + 0.5);
} }
...@@ -2425,26 +2432,33 @@ compute_scalar_stats(VacAttrStatsP stats, ...@@ -2425,26 +2432,33 @@ compute_scalar_stats(VacAttrStatsP stats,
* recommend are considerably more complex, and are numerically * recommend are considerably more complex, and are numerically
* very unstable when n is much smaller than N. * very unstable when n is much smaller than N.
* *
* In this calculation, we consider only non-nulls. We used to
* include rows with null values in the n and N counts, but that
* leads to inaccurate answers in columns with many nulls, and
* it's intuitively bogus anyway considering the desired result is
* the number of distinct non-null values.
*
* Overwidth values are assumed to have been distinct. * Overwidth values are assumed to have been distinct.
*---------- *----------
*/ */
int f1 = ndistinct - nmultiple + toowide_cnt; int f1 = ndistinct - nmultiple + toowide_cnt;
int d = f1 + nmultiple; int d = f1 + nmultiple;
double numer, double n = samplerows - null_cnt;
denom, double N = totalrows * (1.0 - stats->stanullfrac);
stadistinct; double stadistinct;
numer = (double) samplerows *(double) d;
denom = (double) (samplerows - f1) + /* N == 0 shouldn't happen, but just in case ... */
(double) f1 *(double) samplerows / totalrows; if (N > 0)
stadistinct = (n * d) / ((n - f1) + f1 * n / N);
else
stadistinct = 0;
stadistinct = numer / denom;
/* Clamp to sane range in case of roundoff error */ /* Clamp to sane range in case of roundoff error */
if (stadistinct < (double) d) if (stadistinct < d)
stadistinct = (double) d; stadistinct = d;
if (stadistinct > totalrows) if (stadistinct > N)
stadistinct = totalrows; stadistinct = N;
/* And round to integer */
stats->stadistinct = floor(stadistinct + 0.5); stats->stadistinct = floor(stadistinct + 0.5);
} }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment