如何在 Python 中使用 5 位字符编码对英文纯文本（仅由字母 a-z 和空格组成）进行编码？

Question

在Python中，有没有办法使用5位字符编码英文纯文本（仅由小写字母a-z和whitespace组成-即总共27个字符）-编码？如果是，请告诉我怎么做。

更具体地说，假设我有一个字符串： s="hello world"。在 Python 中使用 5 位字符编码对其进行编码后，我想将该字符串保存到外部文件中，这样该文件中的每个字符将只占用 5 位存储空间 space。

Answer 1

可能最受认可的 5 位编码是 Baudot（及其衍生物 ITA2 和 USTTY）。正确地说，这是一种基于移位的编码，具有单独的字母和数字移位，但您可以将输出限制为字母移位。

这是一个编码的简单示例（编码 table 取自 http://code.google.com/p/tweletype/source/browse/baudot.py）：

import string
letters = "\x00E\x0AA SIU\x0DDRJNFCKTZLWHYPQOBG\x0EMXV\x0F"
s = "Hello World"
for c in string.upper(s):
    print letters.find(c)

Answer 2

少于比五位怎么样？使用 translated Lorem ipsum:

的第一段进行测试

import gzip
text = 'But I must explain to you how all this mistaken idea of denouncing of a pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure?'
text = ''.join(c for c in text.lower() if c.islower() or c == ' ')
encoded = gzip.compress(text.encode())
decoded = gzip.decompress(encoded).decode()
print('%.3f' % (len(encoded) / len(text) * 8), 'bits per char')
print('Roundtrip ok?', decoded == text)
print(len(set(text)), 'different chars in text')

结果：

4.135 bits per char
Roundtrip ok? True
26 different chars in text

这样的压缩不仅利用了只有 27 个字符的事实，还利用了不同的概率和模式。

我还尝试了 lzma 和 bz2 而不是 gzip，但对于这个特定示例，gzip 压缩效果最好。

Answer 3

首先，您需要将字符从 ASCII 编码转换为 5 位编码。怎么做取决于你。一种可能的直接方式：

class TooMuchBits(Exception):
    pass

def encode_str(data):
    buf = bytearray()
    for char in data:
        num = ord(char)

        # Lower case latin letters
        if num >= 97 and num <= 122:
            buf.append(num - 96)

        # Space
        elif num == 32:
            buf.append(27)

        else:
            raise TooMuchBits(char)

    return buf

def decode_str(data):
    buf = bytearray()
    for num in data:
        if num == 27:
            buf.append(' ')
        else:
            buf.append(chr(num+96))

    return bytes(buf)

在它之后是 5 位数字，可以打包成 8 位字节。像这样：

# This should not be more than 8
BITS = 5

def get_last_bits(value, count):
    return value & ((1<<count) - 1)

def pack(data):
    buf = bytearray(1)
    used_bits = 0

    for num in data:
        # All zeroes is a special value marking unused bits
        if not isinstance(num, int) or num <= 0 or num.bit_length() > BITS:
            raise TooMuchBits(num)

        # Character fully fits into available bits in current byte
        if used_bits <= 8 - BITS:
            buf[-1] |= num << used_bits
            used_bits += BITS

        # Character should be split into two different bytes
        else:
            # Put lowest bit into available space
            buf[-1] |= get_last_bits(num, 8 - used_bits) << used_bits
            # Put highest bits into next byte
            buf.append(num >> (8 - used_bits))
            used_bits += BITS - 8

    return bytes(buf)

def unpack(data):
    buf = bytearray()
    data = bytearray(data)

    # Characters are filled with logic AND and therefore initialized with zero
    char_value = 0
    char_bits_left = BITS

    for byte in data:
        data_bits_left = 8

        while data_bits_left >= char_bits_left:
            # Current character ends in current byte
            # Take bits from current data bytes and shift them to appropriate position
            char_value |= get_last_bits(byte, char_bits_left) << (BITS - char_bits_left)

            # Discard processed bits
            byte = byte >> char_bits_left
            data_bits_left -= char_bits_left

            # Zero means the end of the string. It's necessary to detect unused space in the end of data
            # It's otherwise possible to detect such space as a 0x0 character
            if char_value == 0:
                break

            # Store and initialize character 
            buf.append(char_value)
            char_value = 0
            char_bits_left = BITS

        # Collect bits left in current byte
        if data_bits_left:
            char_value |= byte
            char_bits_left -= data_bits_left

    return buf

这似乎按预期工作：

test_string = "the quick brown fox jumps over the lazy dog"

encoded = encode_str(test_string)
packed = pack(encoded)
unpacked = unpack(packed)
decoded = decode_str(unpacked)

print "Test str (len: %d): %r" % (len(test_string), test_string)
print "Encoded (len: %d):  %r" % (len(encoded), encoded)
print "Packed (len: %d):   %r" % (len(packed), packed)
print "Unpacked (len: %d): %r" % (len(unpacked),unpacked)
print "Decoded (len: %d):  %r" % (len(decoded), decoded)

输出：

Test str (len: 43): 'the quick brown fox jumps over the lazy dog'
Encoded (len: 43):  bytearray(b'\x14\x08\x05\x1b\x11\x15\t\x03\x0b\x1b\x02\x12\x0f\x17\x0e\x1b\x06\x0f\x18\x1b\n\x15\r\x10\x13\x1b\x0f\x16\x05\x12\x1b\x14\x08\x05\x1b\x0c\x01\x1a\x19\x1b\x04\x0f\x07')
Packed (len: 27):   '\x14\x95\x1dk\x1ak\x0b\xf9\xae\xdb\xe6\xe1\xadj\x83s?[\xe4\xa6\xa8l\x16t\xde\xe4\x1d'
Unpacked (len: 43): bytearray(b'\x14\x08\x05\x1b\x11\x15\t\x03\x0b\x1b\x02\x12\x0f\x17\x0e\x1b\x06\x0f\x18\x1b\n\x15\r\x10\x13\x1b\x0f\x16\x05\x12\x1b\x14\x08\x05\x1b\x0c\x01\x1a\x19\x1b\x04\x0f\x07')
Decoded (len: 43):  'the quick brown fox jumps over the lazy dog'

如何在 Python 中使用 5 位字符编码对英文纯文本（仅由字母 a-z 和空格组成）进行编码？

How to encode English plain-text (consisting only of letters a-z and whitespace) using a 5-bit character encoding in Python?

python

character-encoding