正则表达式——Python[列表查询]

Question

我正在尝试为此列表编写正则表达式：

data= ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]

我想删除所有以大写字母开头的单词，但它不应该检查每个句子的第一个单词，即它不应该检查 Fred、He 和 Rita。

输出应该是

Output-["Fred is. He is working for.", "Rita is another employee in."]

我尝试寻找解决方案，但找不到任何相关代码。任何帮助将不胜感激。

谢谢。

Answer 1

您将需要找到并删除所有不在标点符号后面的大写单词，然后找到并删除尾随空格（此解决方案不是最干净的，但它有效）。列表理解在这里也派上用场。

import re

data = ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]
# find and replace all capital words that don't follow punctuation with ''
text = [re.sub(r'(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)', '', item) for item in data]
# find and remove all trailing spaces before periods
output = [re.sub(r'\s([?.!"](?:\s|$))', r'', item) for item in text]

>>> output
['Fred is. He is working for.', 'Rita is another employee in.']

Answer 2

首先，对于 python 3 的正则表达式文档毫无帮助，我深表歉意。回答这个问题的所有信息都可以 技术上 找到 here，但您已经需要了解一些 re 的工作原理才能理解它。话虽这么说，但愿这能助您一臂之力：

一个简单的答案

您可以尝试以下代码：

import re

data = ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]

matcher = re.compile("(?<![.])[ ][A-Z][A-z]*")
print([matcher.sub("",d) for d in data])
# prints: ['Fred is. He is working for.', 'Rita is another employee in.']

基本上，这编译了一个正则表达式，它将匹配不在句点之后的大写单词：

(?<![.]) -> 如果前面有句点
[ ][A-Z][A-z]* -> 任何大写单词（有前导 space，确保永远不会匹配字符串中的第一个单词）

然后，它将正则表达式应用于列表中的每个字符串，并将匹配项替换为空字符串：""

一些限制

如果您的字符串有双 space 或其他白色 space 字符（如制表符或回车符 returns），这将打破这一点。您可以改用以下方法解决该问题：

matcher = re.compile("(?<![.])\s+[A-Z][A-z]*")

其中 \s+ 将匹配一个或多个白色space 字符

此外，如果您的字符串以 space 开头，这也会打破这一点。您可以使用以下方法解决该问题：

print([matcher.sub("",d.strip(" ")) for d in data])

从字符串中删除开头或结尾的白色space 字符。

正则表达式——Python[列表查询]

Regular expression - Python [list query]

python

regex

regular-language

python-3.x

一个简单的答案

一些限制