• Tom Lane's avatar
    Extend GB18030 encoding conversion to cover full Unicode range. · 8d3e0906
    Tom Lane authored
    Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
    points up to U+FFFF, but the actual spec defines conversions for all code
    points up to U+10FFFF.  That would be rather impractical as a lookup table,
    but fortunately there is a simple algorithmic conversion between the
    additional code points and the equivalent GB18030 byte patterns.  Make use
    of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
    additional conversions.
    
    Having created the infrastructure to do that, we can use the same code to
    map certain linearly-related subranges of the Unicode space below U+FFFF,
    allowing removal of the corresponding lookup table entries.  This more
    than halves the lookup table size, which is a substantial savings;
    utf8_and_gb18030.so drops from nearly a megabyte to about half that.
    
    In support of doing that, replace ISO10646-GB18030.TXT with the data file
    gb-18030-2000.xml (retrieved from
    http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
    in which these subranges have been deleted from the simple lookup entries.
    
    Per bug #12845 from Arjen Nienhuis.  The conversion code added here is
    based on his proposed patch, though I whacked it around rather heavily.
    8d3e0906
gb-18030-2000.xml 826 KB