为什么 Python 和 wc 在字节数上不一致?
Why do Python and wc disagree on byte count?
Python 和 wc
在给定字符串的字节数(长度)上存在严重分歧:
with open("commedia.pfc", "w") as f:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
print(len(t))
f.write(t)
Output : 318885
$> wc commedia.pfc
2181 12282 461491 commedia.pfc
该文件主要由不可读的字符组成,因此我将提供一个十六进制转储:
http://www.filedropper.com/dump_2
该文件是无前缀压缩的结果,如果你问我可以提供生成它的完整代码以及输入文本。
为什么两个字节数不相等?
我添加了压缩算法的完整代码,看起来很长但是文档和测试都很完整,所以应该很容易理解:
"""
Implementation of prefix-free compression and decompression.
"""
import doctest
from itertools import islice
from collections import Counter
import random
import json
def binary_strings(s):
"""
Given an initial list of binary strings `s`,
yield all binary strings ending in one of `s` strings.
>>> take(9, binary_strings(["010", "111"]))
['010', '111', '0010', '1010', '0111', '1111', '00010', '10010', '01010']
"""
yield from s
while True:
s = [b + x for x in s for b in "01"]
yield from s
def take(n, iterable):
"""
Return first n items of the iterable as a list.
"""
return list(islice(iterable, n))
def chunks(xs, n, pad='0'):
"""
Yield successive n-sized chunks from xs.
"""
for i in range(0, len(xs), n):
yield xs[i:i + n]
def reverse_dict(dictionary):
"""
>>> sorted(reverse_dict({1:"a",2:"b"}).items())
[('a', 1), ('b', 2)]
"""
return {value : key for key, value in dictionary.items()}
def prefix_free(generator):
"""
Given a `generator`, yield all the items from it
that do not start with any preceding element.
>>> take(6, prefix_free(binary_strings(["00", "01"])))
['00', '01', '100', '101', '1100', '1101']
"""
seen = []
for x in generator:
if not any(x.startswith(i) for i in seen):
yield x
seen.append(x)
def build_translation_dict(text, starting_binary_codes=["000", "100","111"]):
"""
Builds a dict for `prefix_free_compression` where
More common char -> More short binary strings
This is compression as the shorter binary strings will be seen more times than
the long ones.
Univocity in decoding is given by the binary_strings being prefix free.
>>> sorted(build_translation_dict("aaaaa bbbb ccc dd e", ["01", "11"]).items())
[(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
"""
binaries = sorted(list(take(len(set(text)), prefix_free(binary_strings(starting_binary_codes)))), key=len)
frequencies = Counter(text)
# char value tiebreaker to avoid non-determinism v
alphabet = sorted(list(set(text)), key=(lambda ch: (frequencies[ch], ch)), reverse=True)
return dict(zip(alphabet, binaries))
def prefix_free_compression(text, starting_binary_codes=["000", "100","111"]):
"""
Implements `prefix_free_compression`, simply uses the dict
made with `build_translation_dict`.
Returns a tuple (compressed_message, tranlation_dict) as the dict is needed
for decompression.
>>> prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])[0]
'010101010100111111111001101101101001000100010011001'
"""
translate = build_translation_dict(text, starting_binary_codes)
# print(translate)
return ''.join(translate[i] for i in text), translate
def prefix_free_decompression(compressed, translation_dict):
"""
Decompresses a prefix free `compressed` message in the form of a string
composed only of '0' and '1'.
Being the binary codes prefix free,
the decompression is allowed to take the earliest match it finds.
>>> message, d = prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])
>>> message
'010101010100111111111001101101101001000100010011001'
>>> sorted(d.items())
[(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
>>> ''.join(prefix_free_decompression(message, d))
'aaaaa bbbb ccc dd e'
"""
decoding_translate = reverse_dict(translation_dict)
# print(decoding_translate)
word = ''
for bit in compressed:
# print(word, "-", bit)
if word in decoding_translate:
yield decoding_translate[word]
word = ''
word += bit
yield decoding_translate[word]
if __name__ == "__main__":
doctest.testmod()
with open("commedia.txt") as f:
text = f.read()
compressed, d = prefix_free_compression(text)
with open("commedia.pfc", "w") as f:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
print(len(t))
f.write(t)
with open("commedia.pfcd", "w") as f:
f.write(json.dumps(d))
# dividing by 8 goes from bit length to byte length
print("Compressed / uncompressed ratio is {}".format((len(compressed)//8) / len(text)))
original = ''.join(prefix_free_decompression(compressed, d))
assert original == text
commedia.txt
是 filedropper。com/commedia
您正在使用 Python3 和一个 str
对象 - 这意味着您在 len(t)
中看到的计数是 个字符 字符串。现在,字符不是字节 - and it is so since the 90's 。
由于您没有声明明确的文本编码,文件写入是使用系统默认编码对您的文本进行编码 - 在 Linux 或 Mac OS X 上将是 utf -8 - 一种编码,其中任何超出 ASCII 范围 (ord(ch) > 127) 的字符在磁盘上使用多于一个字节。
所以,你的程序基本上是错误的。首先,定义你是在处理 text 还是 bytes 。如果您处理字节,请打开文件以二进制模式写入(wb
,而不是w
)并更改此行:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
到
t = bytes((int(b, base=2) for b in chunks(compressed, 8))
这样很明显,您是在处理字节本身,而不是处理字符和字节。
当然有一个丑陋的解决方法来对字节对象执行 "transparent encoding" 文本 - (如果您的原始列表将包含 0-256 范围内的所有字符代码点,即) : 在将其写入文件之前,您可以使用 latin1
编码对您之前的 t
进行编码。但这在语义上是错误的。
您还可以尝试使用 Python 鲜为人知的 "bytearray" 对象:它使人能够处理 8 位数字的元素,并且具有可变和可扩展的便利性(就像 C "string" 有足够的内存 space 预分配)
@jsbueno 是对的。此外,如果您以 binary read 模式打开生成的文件,您会得到好的结果:
>>> with open('commedia.pfc', 'rb') as f:
>>> print(len(f.read()))
461491
Python 和 wc
在给定字符串的字节数(长度)上存在严重分歧:
with open("commedia.pfc", "w") as f:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
print(len(t))
f.write(t)
Output : 318885
$> wc commedia.pfc
2181 12282 461491 commedia.pfc
该文件主要由不可读的字符组成,因此我将提供一个十六进制转储:
http://www.filedropper.com/dump_2
该文件是无前缀压缩的结果,如果你问我可以提供生成它的完整代码以及输入文本。
为什么两个字节数不相等?
我添加了压缩算法的完整代码,看起来很长但是文档和测试都很完整,所以应该很容易理解:
"""
Implementation of prefix-free compression and decompression.
"""
import doctest
from itertools import islice
from collections import Counter
import random
import json
def binary_strings(s):
"""
Given an initial list of binary strings `s`,
yield all binary strings ending in one of `s` strings.
>>> take(9, binary_strings(["010", "111"]))
['010', '111', '0010', '1010', '0111', '1111', '00010', '10010', '01010']
"""
yield from s
while True:
s = [b + x for x in s for b in "01"]
yield from s
def take(n, iterable):
"""
Return first n items of the iterable as a list.
"""
return list(islice(iterable, n))
def chunks(xs, n, pad='0'):
"""
Yield successive n-sized chunks from xs.
"""
for i in range(0, len(xs), n):
yield xs[i:i + n]
def reverse_dict(dictionary):
"""
>>> sorted(reverse_dict({1:"a",2:"b"}).items())
[('a', 1), ('b', 2)]
"""
return {value : key for key, value in dictionary.items()}
def prefix_free(generator):
"""
Given a `generator`, yield all the items from it
that do not start with any preceding element.
>>> take(6, prefix_free(binary_strings(["00", "01"])))
['00', '01', '100', '101', '1100', '1101']
"""
seen = []
for x in generator:
if not any(x.startswith(i) for i in seen):
yield x
seen.append(x)
def build_translation_dict(text, starting_binary_codes=["000", "100","111"]):
"""
Builds a dict for `prefix_free_compression` where
More common char -> More short binary strings
This is compression as the shorter binary strings will be seen more times than
the long ones.
Univocity in decoding is given by the binary_strings being prefix free.
>>> sorted(build_translation_dict("aaaaa bbbb ccc dd e", ["01", "11"]).items())
[(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
"""
binaries = sorted(list(take(len(set(text)), prefix_free(binary_strings(starting_binary_codes)))), key=len)
frequencies = Counter(text)
# char value tiebreaker to avoid non-determinism v
alphabet = sorted(list(set(text)), key=(lambda ch: (frequencies[ch], ch)), reverse=True)
return dict(zip(alphabet, binaries))
def prefix_free_compression(text, starting_binary_codes=["000", "100","111"]):
"""
Implements `prefix_free_compression`, simply uses the dict
made with `build_translation_dict`.
Returns a tuple (compressed_message, tranlation_dict) as the dict is needed
for decompression.
>>> prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])[0]
'010101010100111111111001101101101001000100010011001'
"""
translate = build_translation_dict(text, starting_binary_codes)
# print(translate)
return ''.join(translate[i] for i in text), translate
def prefix_free_decompression(compressed, translation_dict):
"""
Decompresses a prefix free `compressed` message in the form of a string
composed only of '0' and '1'.
Being the binary codes prefix free,
the decompression is allowed to take the earliest match it finds.
>>> message, d = prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])
>>> message
'010101010100111111111001101101101001000100010011001'
>>> sorted(d.items())
[(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
>>> ''.join(prefix_free_decompression(message, d))
'aaaaa bbbb ccc dd e'
"""
decoding_translate = reverse_dict(translation_dict)
# print(decoding_translate)
word = ''
for bit in compressed:
# print(word, "-", bit)
if word in decoding_translate:
yield decoding_translate[word]
word = ''
word += bit
yield decoding_translate[word]
if __name__ == "__main__":
doctest.testmod()
with open("commedia.txt") as f:
text = f.read()
compressed, d = prefix_free_compression(text)
with open("commedia.pfc", "w") as f:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
print(len(t))
f.write(t)
with open("commedia.pfcd", "w") as f:
f.write(json.dumps(d))
# dividing by 8 goes from bit length to byte length
print("Compressed / uncompressed ratio is {}".format((len(compressed)//8) / len(text)))
original = ''.join(prefix_free_decompression(compressed, d))
assert original == text
commedia.txt
是 filedropper。com/commedia
您正在使用 Python3 和一个 str
对象 - 这意味着您在 len(t)
中看到的计数是 个字符 字符串。现在,字符不是字节 - and it is so since the 90's 。
由于您没有声明明确的文本编码,文件写入是使用系统默认编码对您的文本进行编码 - 在 Linux 或 Mac OS X 上将是 utf -8 - 一种编码,其中任何超出 ASCII 范围 (ord(ch) > 127) 的字符在磁盘上使用多于一个字节。
所以,你的程序基本上是错误的。首先,定义你是在处理 text 还是 bytes 。如果您处理字节,请打开文件以二进制模式写入(wb
,而不是w
)并更改此行:
t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
到
t = bytes((int(b, base=2) for b in chunks(compressed, 8))
这样很明显,您是在处理字节本身,而不是处理字符和字节。
当然有一个丑陋的解决方法来对字节对象执行 "transparent encoding" 文本 - (如果您的原始列表将包含 0-256 范围内的所有字符代码点,即) : 在将其写入文件之前,您可以使用 latin1
编码对您之前的 t
进行编码。但这在语义上是错误的。
您还可以尝试使用 Python 鲜为人知的 "bytearray" 对象:它使人能够处理 8 位数字的元素,并且具有可变和可扩展的便利性(就像 C "string" 有足够的内存 space 预分配)
@jsbueno 是对的。此外,如果您以 binary read 模式打开生成的文件,您会得到好的结果:
>>> with open('commedia.pfc', 'rb') as f:
>>> print(len(f.read()))
461491