• Dean Rasheed's avatar
    Improve ANALYZE's strategy for finding MCVs. · b5db1d93
    Dean Rasheed authored
    Previously, a value was included in the MCV list if its frequency was
    25% larger than the estimated average frequency of all nonnull values
    in the table.  For uniform distributions, that can lead to values
    being included in the MCV list and significantly overestimated on the
    basis of relatively few (sometimes just 2) instances being seen in the
    sample.  For non-uniform distributions, it can lead to too few values
    being included in the MCV list, since the overall average frequency
    may be dominated by a small number of very common values, while the
    remaining values may still have a large spread of frequencies, causing
    both substantial overestimation and underestimation of the remaining
    values.  Furthermore, increasing the statistics target may have little
    effect because the overall average frequency will remain relatively
    unchanged.
    
    Instead, populate the MCV list with the largest set of common values
    that are statistically significantly more common than the average
    frequency of the remaining values.  This takes into account the
    variance of the sample counts, which depends on the counts themselves
    and on the proportion of the table that was sampled.  As a result, it
    constrains the relative standard error of estimates based on the
    frequencies of values in the list, reducing the chances of too many
    values being included.  At the same time, it allows more values to be
    included, since the MCVs need only be more common than the remaining
    non-MCVs, rather than the overall average.  Thus it tends to produce
    fewer MCVs than the previous code for uniform distributions, and more
    for non-uniform distributions, reducing estimation errors in both
    cases.  In addition, the algorithm responds better to increasing the
    statistics target, allowing more values to be included in the MCV list
    when more of the table is sampled.
    
    Jeff Janes, substantially modified by me. Reviewed by John Naylor and
    Tomas Vondra.
    
    Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
    b5db1d93
analyze.c 88.2 KB