• Tom Lane's avatar
    Avoid generating extra subre tree nodes for capturing parentheses. · ea1268f6
    Tom Lane authored
    Previously, each pair of capturing parentheses gave rise to a separate
    subre tree node, whose only function was to identify that we ought to
    capture the match details for this particular sub-expression.  In
    most cases we don't really need that, since we can perfectly well
    put a "capture this" annotation on the child node that does the real
    matching work.  As with the two preceding commits, the main value
    of this is to avoid generating and optimizing an NFA for a tree node
    that's not really pulling its weight.
    
    The chosen data representation only allows one capture annotation
    per subre node.  In the legal-per-spec, but seemingly not very useful,
    case where there are multiple capturing parens around the exact same
    bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture
    nodes that act the same as before.  We could work harder at that but
    I'll refrain, pending some evidence that such cases are worth troubling
    over.
    
    In passing, improve the comments in regex.h to say what all the
    different re_info bits mean.  Some of them were pretty obvious
    but others not so much, so reverse-engineer some documentation.
    
    This is part of a patch series that in total reduces the regex engine's
    runtime by about a factor of four on a large corpus of real-world regexes.
    
    Patch by me, reviewed by Joel Jacobson
    
    Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
    ea1268f6
regexec.c 36.4 KB