Omit null rows when setting the threshold for what's a most-common value.

As with the previous patch, large numbers of null rows could skew this calculation unfavorably, causing us to discard values that have a legitimate claim to be MCVs, since our definition of MCV is that it's most common among the non-null population of the column. Hence, make the numerator of avgcount be the number of non-null sample values not the number of sample rows; likewise for maxmincount in the compute_scalar_stats variant. Also, make the denominator be the number of distinct values actually observed in the sample, rather than reversing it back out of the computed stadistinct. This avoids depending on the accuracy of the Haas-Stokes approximation, and really it's what we want anyway; the threshold should depend only on what we see in the sample, not on what we extrapolate about the contents of the whole column. Alex Shulgin, reviewed by Tomas Vondra and myself

Omit null rows when setting the threshold for what's a most-common value.
As with the previous patch, large numbers of null rows could skew this calculation unfavorably, causing us to discard values that have a legitimate claim to be MCVs, since our definition of MCV is that it's most common among the non-null population of the column. Hence, make the numerator of avgcount be the number of non-null sample values not the number of sample rows; likewise for maxmincount in the compute_scalar_stats variant. Also, make the denominator be the number of distinct values actually observed in the sample, rather than reversing it back out of the computed stadistinct. This avoids depending on the accuracy of the Haas-Stokes approximation, and really it's what we want anyway; the threshold should depend only on what we see in the sample, not on what we extrapolate about the contents of the whole column. Alex Shulgin, reviewed by Tomas Vondra and myself
3d3bf62f · Tom Lane · 5cb88267 · 3d3bf62f
Commit 3d3bf62f authored Apr 01, 2016 by Tom Lane
Show whitespace changes
Inline Side-by-side

Showing with 9 additions and 11 deletions

src/backend/commands/analyze.c src/backend/commands/analyze.c +9 -11

No files found.
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2133,14 +2133,13 @@ compute_distinct_stats(VacAttrStatsP stats,
 		}
 		else
 		{
-			double		ndistinct = stats->stadistinct;
+			/* d here is the same as d in the Haas-Stokes formula */
+			int			d = nonnull_cnt - summultiple + nmultiple;
 			double		avgcount,
 						mincount;
-			if (ndistinct < 0)
+			/* estimate # occurrences in sample of a typical nonnull value */
-				ndistinct = -ndistinct * totalrows;
+			avgcount = (double) nonnull_cnt / (double) d;
-			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) samplerows / ndistinct;
 			/* set minimum threshold count to store a value */
 			mincount = avgcount * 1.25;
 			if (mincount < 2)
@@ -2494,21 +2493,20 @@ compute_scalar_stats(VacAttrStatsP stats,
 		}
 		else
 		{
-			double		ndistinct = stats->stadistinct;
+			/* d here is the same as d in the Haas-Stokes formula */
+			int			d = ndistinct + toowide_cnt;
 			double		avgcount,
 						mincount,
 						maxmincount;
-			if (ndistinct < 0)
+			/* estimate # occurrences in sample of a typical nonnull value */
-				ndistinct = -ndistinct * totalrows;
+			avgcount = (double) values_cnt / (double) d;
-			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) samplerows / ndistinct;
 			/* set minimum threshold count to store a value */
 			mincount = avgcount * 1.25;
 			if (mincount < 2)
 				mincount = 2;
 			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) samplerows / (double) num_bins;
+			maxmincount = (double) values_cnt / (double) num_bins;
 			if (mincount > maxmincount)
 				mincount = maxmincount;
 			if (num_mcv > track_cnt)