• Tom Lane's avatar
    Fix SQL-style substring() to have spec-compliant greediness behavior. · 7c850320
    Tom Lane authored
    SQL's regular-expression substring() function is defined to have a
    pattern argument that's separated into three subpatterns by escape-
    double-quote markers; the function result is the part of the input
    matching the second subpattern.  The standard makes it clear that
    if there is ambiguity about how to match the input to the subpatterns,
    the first and third subpatterns should be taken to match the smallest
    possible amount of text (i.e., they're "non greedy", in the terms of
    our regex code).  We were not doing it that way: the first subpattern
    would eat the largest possible amount of text, causing the function
    result to be shorter than what the spec requires.
    
    Fix that by attaching explicit greediness quantifiers to the
    subpatterns.  (This depends on the regex fix in commit 8a29ed05;
    before that, this didn't reliably change the regex engine's behavior.)
    
    Also, by adding parentheses around each subpattern, we ensure that
    "|" (OR) in the subpatterns behave sanely.  Previously, "|" in the
    first or third subpatterns didn't work.
    
    This patch also makes the function throw error if you write more than
    two escape-double-quote markers, and do something sane if you write
    just one, and document that behavior.  Previously, an odd number of
    markers led to a confusing complaint about unbalanced parentheses,
    while extra pairs of markers were just ignored.  (Note that the spec
    requires exactly two markers, but we've historically allowed there
    to be none, and this patch preserves the old behavior for that case.)
    
    In passing, adjust some substring() test cases that didn't really
    prove what they said they were testing for: they used patterns
    that didn't match the data string, so that the output would be
    NULL whether or not the function was really strict.
    
    Although this is certainly a bug fix, changing the behavior in back
    branches seems undesirable: applications could perhaps be depending on
    the old behavior, since it's not obviously wrong unless you read the
    spec very closely.  Hence, no back-patch.
    
    Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net
    7c850320
strings.out 50.5 KB