• Tom Lane's avatar
    Fix semantics of regular expression back-references. · 4aea704a
    Tom Lane authored
    POSIX defines the behavior of back-references thus:
    
        The back-reference expression '\n' shall match the same (possibly
        empty) string of characters as was matched by a subexpression
        enclosed between "\(" and "\)" preceding the '\n'.
    
    As far as I can see, the back-reference is supposed to consider only
    the data characters matched by the referenced subexpression.  However,
    because our engine copies the NFA constructed from the referenced
    subexpression, it effectively enforces any constraints therein, too.
    As an example, '(^.)\1' ought to match 'xx', or any other string
    starting with two occurrences of the same character; but in our code
    it does not, and indeed can't match anything, because the '^' anchor
    constraint is included in the backref's copied NFA.  If POSIX intended
    that, you'd think they'd mention it.  Perl for one doesn't act that
    way, so it's hard to conclude that this isn't a bug.
    
    Fix by modifying the backref's NFA immediately after it's copied from
    the reference, replacing all constraint arcs by EMPTY arcs so that the
    constraints are treated as automatically satisfied.  This still allows
    us to enforce matching rules that depend only on the data characters;
    for example, in '(^\d+).*\1' the NFA matching step will still know
    that the backref can only match strings of digits.
    
    Perhaps surprisingly, this change does not affect the results of any
    of a rather large corpus of real-world regexes.  Nonetheless, I would
    not consider back-patching it, since it's a clear compatibility break.
    
    Patch by me, reviewed by Joel Jacobson
    
    Discussion: https://postgr.es/m/661609.1614560029@sss.pgh.pa.us
    4aea704a
func.sgml 989 KB