• Tom Lane's avatar
    Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats(). · bc0f0809
    Tom Lane authored
    We must filter out hashtable entries with frequencies less than those
    specified by the algorithm, else we risk emitting junk entries whose
    actual frequency is much less than other lexemes that did not get
    tabulated.  This is bad enough by itself, but even worse is that
    tsquerysel() believes that the minimum frequency seen in pg_statistic is a
    hard upper bound for lexemes not included, and was thus underestimating
    the frequency of non-MCEs.
    
    Also, set the threshold frequency to something with a little bit of theory
    behind it, to wit assume that the input distribution is approximately
    Zipfian.  This might need adjustment in future, but some preliminary
    experiments suggest that it's not too unreasonable.
    
    Back-patch to 8.4, where this code was introduced.
    
    Jan Urbanski, with some editorialization by Tom
    bc0f0809
ts_typanalyze.c 16.8 KB