-
Dean Rasheed authored
Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
b5db1d93