• Tom Lane's avatar
    Fix misestimation of n_distinct for a nearly-unique column with many nulls. · 95bee941
    Tom Lane authored
    If ANALYZE found no repeated non-null entries in its sample, it set the
    column's stadistinct value to -1.0, intending to indicate that the entries
    are all distinct.  But what this value actually means is that the number
    of distinct values is 100% of the table's rowcount, and thus it was
    overestimating the number of distinct values by however many nulls there
    are.  This could lead to very poor selectivity estimates, as for example
    in a recent report from Andreas Joseph Krogh.  We should discount the
    stadistinct value by whatever we've estimated the nulls fraction to be.
    (That is what will happen if we choose to use a negative stadistinct for
    a column that does have repeated entries, so this code path was just
    inconsistent.)
    
    In addition to fixing the stadistinct entries stored by several different
    ANALYZE code paths, adjust the logic where get_variable_numdistinct()
    forces an "all distinct" estimate on the basis of finding a relevant unique
    index.  Unique indexes don't reject nulls, so there's no reason to assume
    that the null fraction doesn't apply.
    
    Back-patch to all supported branches.  Back-patching is a bit of a judgment
    call, but this problem seems to affect only a few users (else we'd have
    identified it long ago), and it's bad enough when it does happen that
    destabilizing plan choices in a worse direction seems unlikely.
    
    Patch by me, with documentation wording suggested by Dean Rasheed
    
    Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
    Discussion: <16143.1470350371@sss.pgh.pa.us>
    95bee941
ts_typanalyze.c 17 KB