如何对正则表达式中的某些单词进行例外处理

Question

我是编程和正则表达式方面的新手，如果之前有人问过这个问题，我深表歉意（不过我没有找到）。

我想使用 Python 来总结文字文本中的词频。假设文本的格式类似于

Chapter 1
blah blah blah

Chapter 2
blah blah blah
....

现在我把文本读成一个字符串，我想用re.findall得到这个文本中的每一个词，所以我的代码是

wordlist = re.findall(r'\b\w+\b', text)

但问题是它与每个章节标题中的所有这些 Chapter 匹配，我不想将其包含在我的统计信息中。所以我想忽略匹配 Chapter\s*\d+ 的内容。我该怎么办？

提前致谢，伙计们。

Answer 1

匹配您不需要的并捕获您需要的，并将此技术与 re.findall 一起使用，仅 returns 捕获值：

re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)

详情：

\bChapter\s*\d+\b - Chapter 作为一个完整的单词后跟 0+ 个空格和 1+ 个数字
| - 或
\b(\w+)\b - 匹配并捕获到第 1 组一个或多个单词字符

为避免在结果列表中出现空值，请对其进行过滤（参见 demo）：

import re
s = "Chapter 1: Black brown fox 45"
print(filter(None, re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)))

Answer 2

解决方案

您可以先删除所有 Chapter+space+digits :

wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))

如果您只想使用一次搜索，您可以使用否定先行查找前面没有 "Chapter X" 且不以数字开头的任何单词：

wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)

如果性能是个问题，加载一个巨大的字符串并用 Regex 解析它无论如何都不是正确的方法。只需逐行读取文件，抛出匹配 r'^Chapter\s*\d+' 的任何行，然后使用 r'\b\w+\b' :

分别解析剩余的每一行

import re

lines=open("huge_file.txt", "r").readlines()

wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
  if not chapter.match(line):
    wordlist.extend(words.findall(line))

print len(wordlist)

性能

我写了一个小 ruby 脚本来写一个大文件 :

all_dicts = Dir["/usr/share/dict/*"].map{|dict|
  File.readlines(dict)
}.flatten

File.open('huge_file.txt','w+') do |txt|
  newline=true
  txt.puts "Chapter #{rand(1000)}"
  50_000_000.times do
    if rand<0.05
      txt.puts
      txt.puts
      txt.puts "Chapter #{rand(1000)}"
      newline = true
    end
    txt.write " " unless newline
    newline = false
    txt.write all_dicts.sample.chomp
    if rand<0.10
      txt.puts
      newline = true
    end
  end
end

生成的文件超过 5000 万字，大约 483MB 大：

Chapter 154
schoolyard trashcan's holly's continuations

Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's

Chapter 142
spender's
vests
Ladoga

Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's

分两步提取词表平均耗时12.2s，lookahead法耗时13.5s，Wiktor的回答也耗时13.5s。我最开始写的lookahead方法用了re.IGNORECASE，用了18s左右。

读取整个文件时，所有Regexen方法在性能上基本没有差异。

令我惊讶的是，readlines 脚本花费了大约 20.5 秒，并且使用的内存并不比其他脚本少多少。如果您有任何改进脚本的想法，请评论！

如何对正则表达式中的某些单词进行例外处理

How to make exceptions for certain words in regex

python

regex

regex-negation

解决方案

性能