有没有更快的方法来清除文件中的控制字符？

Question

Previously, I had been cleaning out data using the code snippet低于

import unicodedata, re, io

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
    return cc_re.sub('', s)

cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)

我想保留的文件中有换行符。

下面记录cc_re.sub('', s)替换前几行所用的时间（第1列为所用时间，第2列为len(s)）：

0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208

正如@ashwinichaudhary 所建议的那样，使用 s.translate(dict.fromkeys(control_chars)) 并同时使用日志输出：

0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209

但是对于我 1GB 的文本来说代码真的很慢。有没有其他方法清除受控字符？

Answer 1

找到一个按字符工作的解决方案，我使用 100K 文件对其进行基准标记：

import unicodedata, re, io
from time import time

# This is to generate randomly a file to test the script

from string import lowercase
from random import random

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars

fnam = 'filename.txt'

out=io.open(fnam, 'w')

for line in range(1000000):
    out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()


# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
    return cc_re.sub('', s)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
    return ''.join(c for c in s if not c in control_chars)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line = rm_control_chars_1(line)
        cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

输出是：

114.625444174
0.0149750709534

我试了一个1Gb的文件（只针对第二个），持续了186s。

我还编写了相同脚本的另一个版本，速度稍快（176 秒），内存效率更高（对于无法放入 RAM 的非常大的文件）：

t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        out.write(rm_control_chars_1(line))
out.close()
print time() - t0

Answer 2

我会尝试一些事情。

首先，使用替换所有正则表达式进行替换。

其次，使用已知的控制字符范围设置正则表达式字符 class
class 个单独的控制字符。
（这是因为引擎没有将其优化到范围。
一个范围在汇编级别需要两个条件，
而不是以 class)

中的每个字符为条件的单独条件

第三，由于你要删除字符，所以添加一个贪心量词
在 class 之后。这将否定进入替代的必要性
每个字符匹配后的子例程，而不是抓取所有相邻的字符
如所须。

我不知道正则表达式构造的 python 语法，
也不是 Unicode 中的所有控制代码，但结果看起来像
像这样：

[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+

最多的时间是将结果复制到另一个字符串。
最短的时间是找到所有的控制代码，
会很小。

在所有条件都相同的情况下，正则表达式（如上所述）是最快的方法。

Answer 3

与 UTF-8 一样，所有控制字符都以 1 字节（与 ASCII 兼容）和 32 字节编码，我建议使用这段快速代码：

#!/usr/bin/python
import sys

ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]

with open(filename, 'rb') as f1:
  with open(filename + '.txt', 'wb') as f2:
    b = f1.read(1)
    while b != '':
      if ord(b) not in ctrl_chars:
        f2.write(b)
      b = f1.read(1)

可以吗？

Answer 4

这是否必须在 python 中？在开始阅读 python 之前清理文件怎么样。使用 sed 无论如何都会逐行处理它。

请参阅删除 control characters using sed。

如果将其通过管道输出到另一个文件，则可以打开它。我不知道它会有多快。您可以在 shell 脚本中完成并测试它。根据 this page - sed 是每秒 82M 个字符。

希望对您有所帮助。

Answer 5

如果你想让它移动得非常快？将您的输入分成多个块，将该数据处理代码包装为一种方法，并使用 Python 的 multiprocessing 包对其进行并行化，写入一些常见的文本文件。逐个字符是处理此类内容的最简单方法，但它总是需要一段时间。

https://docs.python.org/3/library/multiprocessing.html

Answer 6

我很惊讶没有人提到 mmap 这可能正好适合这里。

注意：如果它有用，我会把它作为答案放入其中，很抱歉我现在没有时间实际测试和比较它。

您将文件加载到内存（某种程度上），然后您实际上可以运行一个 re.sub() 对象。这有助于消除 IO 瓶颈，并允许您在立即写回之前就地更改字节。

在此之后，您可以试验 str.translate() 与 re.sub() 并且还包括任何进一步的优化，例如双缓冲 CPU 和 IO 或使用多个 CPU cores/threads.

但它看起来像这样；

import mmap

f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

mmap 文档的一个很好的摘录是；

..You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a',..

有没有更快的方法来清除文件中的控制字符？

Is there a faster way to clean out control characters in a file?

python

regex

unicode

io

control-characters