• Tom Lane's avatar
    Fix default text search parser's ts_headline code for phrase queries. · c9b0c678
    Tom Lane authored
    This code could produce very poor results when asked to highlight a
    string based on a query using phrase-match operators.  The root cause
    is that hlCover(), which is supposed to find a minimal substring that
    matches the query, was written assuming that word position is not
    significant.  I'm only 95% convinced that its algorithm was correct even
    for plain AND/OR queries; but it definitely fails completely for phrase
    matches, causing it to possibly not identify a cover string at all.
    
    Hence, rewrite hlCover() with a less-tense algorithm that just tries
    all the possible substrings, earlier and shorter ones first.  (This is
    not as bad as it sounds performance-wise, because all of the string
    matching has been done already: the repeated tsquery match checks
    boil down to pointer comparisons.)
    
    Unfortunately, since that approach produces more candidate cover
    strings than before, it also exposes that there were bugs in the
    heuristics in mark_hl_words() for selecting a best cover string.
    Fixes there include:
    * Do not apply the ShortWord filter to words that appear in the query.
    * Remove a misguided optimization for quickly rejecting a cover.
    * Fix order-of-operation bug that could cause computation of a
    wrong figure of merit (poslen) when shortening a cover.
    * Change the preference rule so that candidate headlines that do not
    include their whole cover string (after MaxWords trimming) are lowest
    priority, since they may not actually satisfy the user's query.
    
    This results in some changes in existing regression test cases,
    but they all seem reasonable.  Note in particular that the tests
    involving strings like "1 2 3" were previously being affected by
    the ShortWord filter, masking the normal matching behavior.
    
    Per bug #16345 from Augustinas Jokubauskas; the new test cases are
    based on that example.  Back-patch to 9.6 where phrase search was
    added to tsquery.
    
    Discussion: https://postgr.es/m/16345-2e0cf5cddbdcd3b4@postgresql.org
    c9b0c678
wparser_def.c 76.8 KB