• Tom Lane's avatar
    Clean up formatting.c's logic for matching constant strings. · 4c70098f
    Tom Lane authored
    seq_search(), which is used to match input substrings to constants
    such as month and day names, had a lot of bizarre and unnecessary
    behaviors.  It was mostly possible to avert our eyes from that before,
    but we don't want to duplicate those behaviors in the upcoming patch
    to allow recognition of non-English month and day names.  So it's time
    to clean this up.  In particular:
    
    * seq_search scribbled on the input string, which is a pretty dangerous
    thing to do, especially in the badly underdocumented way it was done here.
    Fortunately the input string is a temporary copy, but that was being made
    three subroutine levels away, making it something easy to break
    accidentally.  The behavior is externally visible nonetheless, in the form
    of odd case-folding in error reports about unrecognized month/day names.
    The scribbling is evidently being done to save a few calls to pg_tolower,
    but that's such a cheap function (at least for ASCII data) that it's
    pretty pointless to worry about.  In HEAD I switched it to be
    pg_ascii_tolower to ensure it is cheap in all cases; but there are corner
    cases in Turkish where this'd change behavior, so leave it as pg_tolower
    in the back branches.
    
    * seq_search insisted on knowing the case form (all-upper, all-lower,
    or initcap) of the constant strings, so that it didn't have to case-fold
    them to perform case-insensitive comparisons.  This likewise seems like
    excessive micro-optimization, given that pg_tolower is certainly very
    cheap for ASCII data.  It seems unsafe to assume that we know the case
    form that will come out of pg_locale.c for localized month/day names, so
    it's better just to define the comparison rule as "downcase all strings
    before comparing".  (The choice between downcasing and upcasing is
    arbitrary so far as English is concerned, but it might not be in other
    locales, so follow citext's lead here.)
    
    * seq_search also had a parameter that'd cause it to report a match
    after a maximum number of characters, even if the constant string were
    longer than that.  This was not actually used because no caller passed
    a value small enough to cut off a comparison.  Replicating that behavior
    for localized month/day names seems expensive as well as useless, so
    let's get rid of that too.
    
    * from_char_seq_search used the maximum-length parameter to truncate
    the input string in error reports about not finding a matching name.
    This leads to rather confusing reports in many cases.  Worse, it is
    outright dangerous if the input string isn't all-ASCII, because we
    risk truncating the string in the middle of a multibyte character.
    That'd lead either to delivering an illegible error message to the
    client, or to encoding-conversion failures that obscure the actual
    data problem.  Get rid of that in favor of truncating at whitespace
    if any (a suggestion due to Alvaro Herrera).
    
    In addition to fixing these things, I const-ified the input string
    pointers of DCH_from_char and its subroutines, to make sure there
    aren't any other scribbling-on-input problems.
    
    The risk of generating a badly-encoded error message seems like
    enough of a bug to justify back-patching, so patch all supported
    branches.
    
    Discussion: https://postgr.es/m/29432.1579731087@sss.pgh.pa.us
    4c70098f
formatting.c 156 KB