• Tom Lane's avatar
    Fix serious performance problems in json(b) to_tsvector(). · b4c6d31c
    Tom Lane authored
    In an off-list followup to bug #14745, Bob Jones complained that
    to_tsvector() on a 2MB jsonb value took an unreasonable amount of
    time and space --- enough to draw the wrath of the OOM killer on
    his machine.  On my machine, his example proved to require upwards
    of 18 seconds and 4GB, which seemed pretty bogus considering that
    to_tsvector() on the same data treated as text took just a couple
    hundred msec and 10 or so MB.
    
    On investigation, the problem is that the implementation scans each
    string element of the json(b) and converts it to tsvector separately,
    then applies tsvector_concat() to join those separate tsvectors.
    The unreasonable memory usage came from leaking every single one of
    the transient tsvectors --- but even without that mistake, this is an
    O(N^2) or worse algorithm, because tsvector_concat() has to repeatedly
    process the words coming from earlier elements.
    
    We can fix it by accumulating all the lexeme data and applying
    make_tsvector() just once.  As a side benefit, that also makes the
    desired adjustment of lexeme positions far cheaper, because we can
    just tweak the running "pos" counter between JSON elements.
    
    In passing, try to make the explanation of that tweak more intelligible.
    (I didn't think that a barely-readable comment far removed from the
    actual code was helpful.)  And do some minor other code beautification.
    b4c6d31c
to_tsany.c 12.4 KB