• Tom Lane's avatar
    Implement lookbehind constraints in our regular-expression engine. · 12c9a040
    Tom Lane authored
    A lookbehind constraint is like a lookahead constraint in that it consumes
    no text; but it checks for existence (or nonexistence) of a match *ending*
    at the current point in the string, rather than one *starting* at the
    current point.  This is a long-requested feature since it exists in many
    other regex libraries, but Henry Spencer had never got around to
    implementing it in the code we use.
    
    Just making it work is actually pretty trivial; but naive copying of the
    logic for lookahead constraints leads to code that often spends O(N^2) time
    to scan an N-character string, because we have to run the match engine
    from string start to the current probe point each time the constraint is
    checked.  In typical use-cases a lookbehind constraint will be written at
    the start of the regex and hence will need to be checked at every character
    --- so O(N^2) work overall.  To fix that, I introduced a third copy of the
    core DFA matching loop, paralleling the existing longest() and shortest()
    loops.  This version, matchuntil(), can suspend and resume matching given
    a couple of pointers' worth of storage space.  So we need only run it
    across the string once, stopping at each interesting probe point and then
    resuming to advance to the next one.
    
    I also put in an optimization that simplifies one-character lookahead and
    lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
    constraints, which already existed in the engine.  This avoids the overhead
    of the LACON machinery entirely for these rather common cases.
    
    The net result is that lookbehind constraints run a factor of three or so
    slower than Perl's for multi-character constraints, but faster than Perl's
    for one-character constraints ... and they work fine for variable-length
    constraints, which Perl gives up on entirely.  So that's not bad from a
    competitive perspective, and there's room for further optimization if
    anyone cares.  (In reality, raw scan rate across a large input string is
    probably not that big a deal for Postgres usage anyway; so I'm happy if
    it's linear.)
    12c9a040
re_syntax.n 27.8 KB