src/backend/regex/regcomp.c · 5e6ad63c6db79a5515e7a4117b5de053530fd7d9 · Abuhujair Javed / Postgres FD Implementation

Make regexp engine's backref-related compilation state more bulletproof. · 5e6ad63c

Tom Lane authored Aug 07, 2021

Up to now, we remembered the definition of a capturing parenthesis
subexpression by storing a pointer to the associated subRE node.
That was okay before, because that subRE didn't get modified anymore
while parsing the rest of the regexp. However, in the wake of
commit ea1268f6, that's no longer true: the outer invocation of
parseqatom() feels free to scribble on that subRE. This seems to
work anyway, because the states we jam into the child atom in the
"prepare a general-purpose state skeleton" stanza aren't really
semantically different from the original endpoints of the child atom.
But that would be mighty easy to break, and it's definitely not how
things worked before.

Between this and the issue fixed in the prior commit, it seems best
to get rid of this dependence on subRE nodes entirely. We don't need
the whole child subRE for future backrefs, only its starting and ending
NFA states; so let's just store pointers to those.

Also, in the corner case where we make an extra subRE to handle
immediately-nested capturing parentheses, it seems like it'd be smart
to have the extra subRE have the same begin/end states as the original
child subRE does (s/s2 not lp/rp). I think that linking it from lp to
rp might actually be semantically wrong, though since Spencer's original
code did it that way, I'm not totally certain. Using s/s2 is certainly
not wrong, in any case.

Per report from Mark Dilger. Back-patch to v14 where the problematic
patches came in.

Discussion: https://postgr.es/m/0203588E-E609-43AF-9F4F-902854231EE7@enterprisedb.com

5e6ad63c

regcomp.c 68.1 KB

Replace regcomp.c