• Tom Lane's avatar
    Recognize "match-all" NFAs within the regex engine. · 824bf719
    Tom Lane authored
    This builds on the previous "rainbow" patch to detect NFAs that will
    match any string, though possibly with constraints on the string length.
    This definition is chosen to match constructs such as ".*", ".+", and
    ".{1,100}".  Recognizing such an NFA after the optimization pass is
    fairly cheap, since we basically just have to verify that all arcs
    are RAINBOW arcs and count the number of steps to the end state.
    (Well, there's a bit of complication with pseudo-color arcs for string
    boundary conditions, but not much.)
    
    Once we have these markings, the regex executor functions longest(),
    shortest(), and matchuntil() don't have to expend per-character work
    to determine whether a given substring satisfies such an NFA; they
    just need to check its length against the bounds.  Since some matching
    problems require O(N) invocations of these functions, we've reduced
    the runtime for an N-character string from O(N^2) to O(N).  Of course,
    this is no help for non-matchall sub-patterns, but those usually have
    constraints that allow us to avoid needing O(N) substring checks in the
    first place.  It's precisely the unconstrained "match-all" cases that
    cause the most headaches.
    
    This is part of a patch series that in total reduces the regex engine's
    runtime by about a factor of four on a large corpus of real-world regexes.
    
    Patch by me, reviewed by Joel Jacobson
    
    Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
    824bf719
regprefix.c 7.21 KB