• Tom Lane's avatar
    Fix regex back-references that are directly quantified with *. · 5223f96d
    Tom Lane authored
    The syntax "\n*", that is a backref with a * quantifier directly applied
    to it, has never worked correctly in Spencer's library.  This has been an
    open bug in the Tcl bug tracker since 2005:
    https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
    
    The core of the problem is in parseqatom(), which first changes "\n*" to
    "\n+|" and then applies repeat() to the NFA representing the backref atom.
    repeat() thinks that any arc leading into its "rp" argument is part of the
    sub-NFA to be repeated.  Unfortunately, since parseqatom() already created
    the arc that was intended to represent the empty bypass around "\n+", this
    arc gets moved too, so that it now leads into the state loop created by
    repeat().  Thus, what was supposed to be an "empty" bypass gets turned into
    something that represents zero or more repetitions of the NFA representing
    the backref atom.  In the original example, in place of
    	^([bc])\1*$
    we now have something that acts like
    	^([bc])(\1+|[bc]*)$
    At runtime, the branch involving the actual backref fails, as it's supposed
    to, but then the other branch succeeds anyway.
    
    We could no doubt fix this by some rearrangement of the operations in
    parseqatom(), but that code is plenty ugly already, and what's more the
    whole business of converting "x*" to "x+|" probably needs to go away to fix
    another problem I'll mention in a moment.  Instead, this patch suppresses
    the *-conversion when the target is a simple backref atom, leaving the case
    of m == 0 to be handled at runtime.  This makes the patch in regcomp.c a
    one-liner, at the cost of having to tweak cbrdissect() a little.  In the
    event I went a bit further than that and rewrote cbrdissect() to check all
    the string-length-related conditions before it starts comparing characters.
    It seems a bit stupid to possibly iterate through many copies of an
    n-character backreference, only to fail at the end because the target
    string's length isn't a multiple of n --- we could have found that out
    before starting.  The existing coding could only be a win if integer
    division is hugely expensive compared to character comparison, but I don't
    know of any modern machine where that might be true.
    
    This does not fix all the problems with quantified back-references.  In
    particular, the code is still broken for back-references that appear within
    a larger expression that is quantified (so that direct insertion of the
    quantification limits into the BACKREF node doesn't apply).  I think fixing
    that will take some major surgery on the NFA code, specifically introducing
    an explicit iteration node type instead of trying to transform iteration
    into concatenation of modified regexps.
    
    Back-patch to all supported branches.  In HEAD, also add a regression test
    case for this.  (It may seem a bit silly to create a regression test file
    for just one test case; but I'm expecting that we will soon import a whole
    bunch of regex regression tests from Tcl, so might as well create the
    infrastructure now.)
    5223f96d
regcomp.c 49.9 KB