将字符编码映射到每个字符的最大字节数

Question

我正在寻找一个 table 将给定的字符编码映射到每个字符的（最大，在可变长度编码的情况下）字节数。对于固定宽度的编码，这很容易，但我不知道，在一些更深奥的编码的情况下，宽度是多少。对于 UTF-8 等，最好根据字符串中的最高代码点 确定每个字符的最大字节数 ，但这并不那么紧迫。

对于一些背景知识（如果你不熟悉 Numpy，你可以忽略它，我正在研究一个 ndarray 子类的原型，它可以以一定的透明度表示编码字节数组（包括纯 ASCII）作为 unicode 字符串数组，而不是一次将整个数组实际转换为 UCS4。这个想法是底层 dtype 仍然是 S<N> dtype，其中 <N> 是数组中每个字符串的（最大）字节数。但是项目查找和字符串方法使用正确的编码即时解码字符串。可以看到一个非常粗略的原型 here，尽管最终这部分可能会在 C 中实现。对我的用例来说，最重要的是有效使用内存，而字符串的重复解码和重新编码是 acceptable 开销。

无论如何，因为底层 dtype 以字节为单位，所以它不会告诉用户任何关于可以写入给定编码文本数组的字符串长度的有用信息。因此，如果没有别的，拥有这样一个任意编码的映射对于改进用户界面将非常有用。

注意： 我在这里找到了 特定于 到 Java 的基本相同问题的答案：How can I programatically determine the maximum size in bytes of a character in a specific charset? 但是，我无法在 Python 中找到任何等效项，也无法找到一个有用的信息数据库来实现我自己的。

Answer 1

蛮力法。遍历所有可能的 Unicode 字符并跟踪使用的最大字节数。

def max_bytes_per_char(encoding):
    max_bytes = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            max_bytes = max(max_bytes, len(encoded))
        except UnicodeError:
            pass
    return max_bytes


>>> max_bytes_per_char('UTF-8')
4

Answer 2

虽然我接受了@dan04 的答案，但我也在这里添加了我自己的答案，该答案受到@dan04 的启发，但更进一步，它给出了给定编码支持的所有字符的编码宽度，并且编码到该宽度的字符范围（其中宽度 0 表示它不受支持）：

从集合导入 defaultdict

def encoding_ranges(encoding):
    codepoint_ranges = defaultdict(list)
    cur_nbytes = None
    start = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            nbytes = len(encoded)
        except UnicodeError:
            nbytes = 0

        if nbytes != cur_nbytes and cur_nbytes is not None:
            if codepoint - start > 2:
                codepoint_ranges[cur_nbytes].append((start, codepoint))
            else:
                codepoint_ranges[cur_nbytes].extend(range(start, codepoint))

            start = codepoint

        cur_nbytes = nbytes

    codepoint_ranges[cur_nbytes].append((start, codepoint + 1))
    return codepoint_ranges

例如：

>>> encoding_ranges('ascii')
defaultdict(<class 'list'>, {0: [(128, 1114112)], 1: [(0, 128)]})
>>> encoding_ranges('utf8')
defaultdict(<class 'list'>, {0: [(55296, 57344)], 1: [(0, 128)], 2: [(128, 2048)], 3: [(2048, 55296), (57344, 65536)], 4: [(65536, 1114112)]})
>>> encoding_ranges('shift_jis')

对于 2 个或更少字符的范围，它只记录代码点本身而不是范围，这对于更笨拙的编码更有用，例如 shift_jis。

将字符编码映射到每个字符的最大字节数

Mapping of character encodings to maximum bytes per character

python

numpy

character-encoding