• Tom Lane's avatar
    Omit null rows when applying the Haas-Stokes estimator for ndistinct. · be4b4dc7
    Tom Lane authored
    Previously, we included null rows in the values of n and N that went
    into the formula, which amounts to considering null as a value in its
    own right; but the d and f1 values do not include nulls.  This is
    inconsistent, and it contributes to significant underestimation of
    ndistinct when the column is mostly nulls.  In any case stadistinct
    is defined as the number of distinct non-null values, so we should
    exclude nulls when doing this computation.
    
    This is an aboriginal bug in our application of the Haas-Stokes formula,
    but we'll refrain from back-patching for fear of destabilizing plan
    choices in released branches.
    
    While at it, make the code a bit more readable by omitting unnecessary
    casts and intermediate variables.
    
    Observation and original patch by Tomas Vondra, adjusted to fix both
    uses of the formula by Alex Shulgin, cosmetic improvements by me
    be4b4dc7
analyze.c 81.1 KB