Python: 将二进制文字文本文件转换为普通文本

Question

我有一个这种格式的文本文件：

b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'

我想阅读这些行并将它们转换为

Chapter 1 - BlaBla
Boy's Dead.

并将它们替换到同一个文件中。我已经尝试使用 print(line.encode("UTF-8", "replace")) 进行编码和解码，但没有用

Answer 1

strings = [
    b'Chapter 1 \xe2\x80\x93 BlaBla',
    b'Boy\xe2\x80\x99s Dead.',
]

for string in strings:
    print(string.decode('utf-8', 'ignore'))

--output:--
Chapter 1 – BlaBla
Boy’s Dead.

and replace them on the same file.

世界上没有一种计算机编程语言可以做到这一点。您必须将输出写入新文件，删除旧文件，并将新文件重命名为旧文件。但是，python 的 fileinput 模块可以为您执行该过程：

import fileinput as fi
import sys

with open('data.txt', 'wb') as f:
    f.write(b'Chapter 1 \xe2\x80\x93 BlaBla\n')
    f.write(b'Boy\xe2\x80\x99s Dead.\n')

with open('data.txt', 'rb') as f:
    for line in f:
        print(line)

with fi.input(
        files = 'data.txt', 
        inplace = True,
        backup = '.bak',
        mode = 'rb') as f:

    for line in f:
        string = line.decode('utf-8', 'ignore')
        print(string, end="")

~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla\n'
b'Boy\xe2\x80\x99s Dead.\n'

~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.

编辑：

import fileinput as fi
import re

pattern = r"""
    \              #Match a literal slash...
    x               #Followed by an x...
    [a-f0-9]{2}     #Followed by any hex character, 2 times 
"""

repl = ''

with open('data.txt', 'w') as f:
    print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
    print(r"b'Boy\xe2\x80\x99s Dead.'", file=f)

with open('data.txt') as f:
    for line in f:
        print(line.rstrip()) #Output goes to terminal window

with fi.input(
        files = 'data.txt', 
        inplace = True,
        backup = '.bak') as f:

    for line in f:
        line = line.rstrip()[2:-1]
        new_line = re.sub(pattern,  "", line, flags=re.X)
        print(new_line) #Writes to file, not your terminal window

~/python_programs$ python3.4 prog.py 
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'

~/python_programs$ cat data.txt
Chapter 1  BlaBla
Boys Dead.

您的文件不包含二进制数据，因此您可以在text mode中读取（或写入）它。这只是正确转义的问题。

这是第一部分：

print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)

Python 将字符串中的某些 backslash escape sequences 转换为其他内容。 python 转换的反斜杠转义序列之一的格式为：

\xNN  #=> e.g. \xe2

反斜杠转义序列长度为四个字符，但 python 将反斜杠转义序列转换为单个字符。

但是，我需要将这四个字符中的每一个都写入我创建的示例文件中。要防止 python 将反斜杠转义序列转换为一个字符，您可以将开头的 '\' 转义为另一个 '\'：

\xNN

但是因为懒惰，我不想遍历你的字符串并手动转义每个反斜杠转义序列，所以我使用了：

r"...."

r string 为您转义所有反斜杠。结果，python 将 \xNN 序列的所有四个字符写入文件。

下一个问题是 replacing a backslash in a string using a regex——我想这就是您的问题所在。当文件包含 \ 时，python 将其读入字符串 \ 以表示文字反斜杠。结果，如果文件包含四个字符：

\xe2

python 将其读入字符串：

"\xe2"

打印后的样子：

\xe2

底线是：如果您可以在打印出的字符串中看到“\”，则说明字符串中的反斜杠被转义了。要查看字符串中的真实内容，您应该始终使用 repr().

string = "\xe2"
print(string)
print(repr(string))

--output:--
\xe2
'\xe2'

请注意，如果输出周围有引号，那么您将看到字符串中的所有内容。如果输出没有引号，那么您无法确定字符串中的确切内容。

要构建匹配字符串中文字反斜杠的正则表达式模式，简短的回答是：您需要使用双倍数量的反斜杠。使用字符串：

"\xe2"

你会认为模式是：

pattern = "\x"

但是根据加倍规则，你实际上需要：

pattern = "\\x"

还记得 r 个字符串吗？如果您使用 r 字符串作为模式，那么您可以编写看起来合理的内容，然后 r 字符串将转义所有斜杠，将它们加倍：

pattern r"\x"  #=> equivalent to "\\x"

Python: 将二进制文字文本文件转换为普通文本

Python: Converting Binary Literal text file to Normal Text

python

encoding

utf-8

web-crawler

utf