No announcement yet.

I Made a Character Encoding for Mainland China

  • Filter
  • Time
  • Show
Clear All
new posts

  • I Made a Character Encoding for Mainland China

    I've been taking a deep dive recently into character encoding. I've noticed that China has their own called GB 18030 which is both backwards compatible with their legacy encodings while also being capable of encoding any Unicode code point. Like UTF-8, it is ASCII-compatible and doesn't have implicit null bytes. And like UTF-16, it can encode most simplified Chinese with just two bytes.

    GB 18030 has two massive drawbacks, though. I discovered these while writing my own implementation for .NET 5:

    1. Decoding each character involves a mix of checking lookup tables and/or iterating through ranges.
    2. There is no self synchronization for Unicode values.

    Specifically, GB 18030 has a number of possible byte patterns. Instead of bit matching (UTF-8, UTF-16), China makes decisions based on the range a byte falls in rather than certain bits being enabled or disabled:

    ASCII mode:
    1 byte

    GBK mode:
    2 bytes
    0x81-0xFE | 0x40-0xFE

    Unicode mode:
    4 bytes
    0x81-0xFE | 0x30-0x39 | 0x81-0xFE | 0x30-0x39

    The ASCII mode up there makes sense. So does GBK, which also happens to be compatible with GB 2312 from 1980. But that Unicode approach is a bit hard to deal with. For one thing, we need to consult a range table to determine how to algorithmically map between byte sequences and code point values.

    So let's try this for Unicode instead:

    4 bytes
    0b1xxxxxx0 | 0b001xxxxx | 0b001xxxxx | 0b001xxxxx

    There. That scheme addresses 2^21 bits, and we just put the code point value in there plane and simple. This gives us complete Unicode coverage. And we don't have to iterate through any ranges anymore, and there's also only one look up table now (for the GBK range). When benchmarking round tripping across every Unicode code point (sans the surrogate range), this new encoding was 4X as fast. It's also not mistakable for a GBK sequence and it self-synchronizes.

    I'm currently calling it GBX.

    GB 18030 implementation for .NET 5:

    GBX implementation for .NET 5:

    EDIT: No, this won't amount to anything. It was quite fun to make, though.