在 Python 中仅替换一次 unicode 字符

Replace unicode characters only once in Python

我正在尝试创建一个小脚本来替换文件中的一组字符,如下所示:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

第一个 运行。

它工作正常

因此,如果我有一个包含 Eºti sigur cã vrei sã ºtergi fiºierele? 的文件,在对该文件执行 运行 脚本后,我会得到 Ești sigur că vrei să ștergi fișierele?,这正是我想要的。但是如果我 运行 它多次我得到:

EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

EĂÂti sigur cĂÂ vrei sĂÂ ĂÂtergi fiĂÂierele?

EÄÂĂÂti sigur cÄÂĂÂ vrei sÄÂĂÂ ÄÂĂÂtergi fiÄÂĂÂierele?

EĂÂĂÂÄÂĂÂti sigur cĂÂĂÂÄÂĂÂ vrei sĂÂĂÂÄÂĂÂ ĂÂĂÂÄÂĂÂtergi fiĂÂĂÂÄÂĂÂierele?

我不明白为什么。它如何在文件 (ã, º, etc.) 中找到一些不再存在的字符来替换它们?为什么还要用其他字符替换它们?

不处理编码内容。仅在写入新文件时编码:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

很简单 - 这是因为在第一个 运行 中,您正在读取 ISO-8859-1 并写入 UTF-8。然后在第二个 运行 上,尽管输入现在是 UTF-8 而不是 ISO-8859-1,但你做的完全一样。在随后的 运行 秒中,搜索和替换不再有效。

此测试模仿您的第二次迭代 - 将 UTF-8 解释为 ISO-8859-1:

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

下一次迭代看起来像:

print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?

听从@Daniel 的建议解码一次,用 Unicode 替换 Unicode 然后编码一次。我还被告知最好使用 io.open() 而不是 codecs,因为它与 Python 3 兼容并解决了通用新行的问题。

"utf-8" 内容上使用 "ISO-8859-1" 字符编码是不正确的:第一次 运行 你的脚本需要一个文本文件(大概是 "ISO-8859-1" 编码) 并在替换某些 Unicode 字符时将其保存为 "utf-8"

然后你 运行 第二次转换然后它需要 "utf-8" 内容并尝试将其解释为 "ISO-8859-1" 错误 .

为避免混淆,将替换与字符编码的更改分开进行。因此替换将是幂等的。

要进行替换,您可以使用 fileinput 模块和 unicode.translate():

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin

replacements = {
    u"ã": u"ă",
    u"Ã": u"Ă",
    u"º": u"ș",
    u"ª": u"Ș",
    u"þ": u"ț",
    u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
    print(line.translate(replacements))

要控制输出文件的编码,您可以设置 PYTHONIOENCODING 例如,在 bash:

$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8

此命令既替换字符,又将输入从 "iso-8859-1" 转码为 "utf-8"

如果输入 filename.txt 已经损坏(没有单个字符编码可以正确解码它),那么您可以 try ftfy module 修复常见的编码错误:

$ ftfy filename.txt >filename.utf8.txt