• Tom Lane's avatar
    Omit null rows when setting the threshold for what's a most-common value. · 3d3bf62f
    Tom Lane authored
    As with the previous patch, large numbers of null rows could skew this
    calculation unfavorably, causing us to discard values that have a
    legitimate claim to be MCVs, since our definition of MCV is that it's
    most common among the non-null population of the column.  Hence, make
    the numerator of avgcount be the number of non-null sample values not
    the number of sample rows; likewise for maxmincount in the
    compute_scalar_stats variant.
    
    Also, make the denominator be the number of distinct values actually
    observed in the sample, rather than reversing it back out of the computed
    stadistinct.  This avoids depending on the accuracy of the Haas-Stokes
    approximation, and really it's what we want anyway; the threshold should
    depend only on what we see in the sample, not on what we extrapolate
    about the contents of the whole column.
    
    Alex Shulgin, reviewed by Tomas Vondra and myself
    3d3bf62f
analyze.c 81.1 KB