• Tom Lane's avatar
    Improve to_date/to_number/to_timestamp behavior with multibyte characters. · 976a1a48
    Tom Lane authored
    The documentation says that these functions skip one input character
    per literal (non-pattern) format character.  Actually, though, they
    skipped one input *byte* per literal *byte*, which could be hugely
    confusing if either data or format contained multibyte characters.
    
    To fix, adjust the FormatNode representation and parse_format() so
    that multibyte format characters are stored as one FormatNode not
    several, and adjust the data-skipping bits to advance by pg_mblen()
    not necessarily one byte.  There's no user-visible behavior change
    on the to_char() side, although the internal representation changes.
    
    Commit e87d4965 had already fixed most places where we skip characters
    on the basis of non-literal format patterns to advance by characters
    not bytes, but this gets one more place, the SKIP_THth macro.  I think
    everything in formatting.c gets that right now.
    
    It'd be nice to have some regression test cases covering this behavior;
    but of course there's no way to do so in an encoding-agnostic way, and
    many of the interesting aspects would also require unportable locale
    selections.  So I've not bothered here.
    
    Discussion: https://postgr.es/m/28186.1510957703@sss.pgh.pa.us
    976a1a48
formatting.c 137 KB