在 python 中跳过未知数量的单词
Skipping an unknown number of words in python
所以我通常只是提取短语并在我 运行 文档上的脚本后以预先指定的格式打印出来。
我用这个代码来拆分我的设置
def iterphrases(text):
return re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
然后我读了文件,如果这个词在文件中,我就把这个句子追加到字典中。
def find_keywords(OutputFile, keys):
phrase_combos= keys + [x.upper() for x in keys] + [x.lower() for x in keys] + [x.capitalize() for x in keys]
keys = list(set(phrase_combos))
cwd = os.getcwd()
print 'Working in current directory : ', cwd
cwdfiles = os.listdir(cwd)
filenames = []
for item in cwdfiles:
if item[-4:] == '.txt':
filenames.append(item)
out = defaultdict(list)
for filename in filenames:
for phrase in iterphrases(open(filename).read()):
for keyword in keys:
if phrase.lower().index('no') < phrase.index(keyword):
out[keyword].append((filename, phrase))
my_dict= dict(**out)
我用它做了一些事情,它已经工作了一段时间,但现在我需要找到一些不是东西的东西。我可以找到很多短语,但有些会跳过单词,并且不会完全匹配,例如,如果我的短语是单词 foo。
没有富。不是福。不是 foo 或 bar。没有 foo 也没有 bar。都在我的字典里但我还需要:
Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana.
全部显示为结果为好。现在它无法匹配它,因为 bar foo 不在否定词旁边。有没有办法让我说 'Match if negative words appear regardless of how many other words are between the word/phrase of interest as long as you are in the same sentence' ?
例如创建这样的东西。
This is a group of Text. There is no foo. There is no bar. There is no foo
or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you.
应该return:
{'bar' :没有栏。 , 没有 bar 或 foo。 , 没有 foo 或 bar。, 没有适合你的 bar。}
尝试使用正则表达式进行搜索。您可以搜索关键字列表并使用否定列表来否定它们。
诀窍是编译一个正则表达式,在你的句子中搜索 'a negation word in somewhere before my keyword'。这意味着:
re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)
其中 \b
表示 'word boundary'。所以它是一个词,后面是乱码(.+
),后面是一个词。使用 format
我们将单词设置为否定词和关键字。 re.I
设置忽略案例标志。
现在有了你所有的例子和一些我认为你不想匹配 'Nonono this is not the right foo' 或 'Anonymus foo...' 的例子,我想到了以下,这应该给你一个起点:
import re
text = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana. This is a group of Text. There is no foo. There is no bar. There is no foo or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you. Nonono, this is the wrong foo. Nono this is also a wrong foo. Anonymous foo.'
keywords = ['foo']
negated = ['no', 'not']
phraselist = re.split(r'\.\s', text)
out = {}
for phrase in phraselist:
for keyword in keywords:
for neg in negated:
regex = re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)
if regex.search(phrase.lower()):
try:
if not phrase in out[keyword]:
out[keyword].append(phrase)
except KeyError:
out[keyword] = [phrase]
print(out)
expected = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo or banana. There is no foo. There is no foor or bar. There is no bar or foo.'
print(expected)
输出为:
{'foo': ['Not foo', 'Not No foo', 'Not foo or bar', 'No foo and no bar', 'Not ba
r or foo', 'Not bar or foo or banana', 'Not bar or banana or foo', 'Not bar, ban
ana, or foo', 'Not bar, foo, or banana', 'There is no foo', 'There is no foo or
bar', 'There is no bar or foo']}
Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar
or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, fo
o or banana. There is no foo. There is no foor or bar. There is no bar or foo.
所以我通常只是提取短语并在我 运行 文档上的脚本后以预先指定的格式打印出来。
我用这个代码来拆分我的设置
def iterphrases(text):
return re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
然后我读了文件,如果这个词在文件中,我就把这个句子追加到字典中。
def find_keywords(OutputFile, keys):
phrase_combos= keys + [x.upper() for x in keys] + [x.lower() for x in keys] + [x.capitalize() for x in keys]
keys = list(set(phrase_combos))
cwd = os.getcwd()
print 'Working in current directory : ', cwd
cwdfiles = os.listdir(cwd)
filenames = []
for item in cwdfiles:
if item[-4:] == '.txt':
filenames.append(item)
out = defaultdict(list)
for filename in filenames:
for phrase in iterphrases(open(filename).read()):
for keyword in keys:
if phrase.lower().index('no') < phrase.index(keyword):
out[keyword].append((filename, phrase))
my_dict= dict(**out)
我用它做了一些事情,它已经工作了一段时间,但现在我需要找到一些不是东西的东西。我可以找到很多短语,但有些会跳过单词,并且不会完全匹配,例如,如果我的短语是单词 foo。
没有富。不是福。不是 foo 或 bar。没有 foo 也没有 bar。都在我的字典里但我还需要:
Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana.
全部显示为结果为好。现在它无法匹配它,因为 bar foo 不在否定词旁边。有没有办法让我说 'Match if negative words appear regardless of how many other words are between the word/phrase of interest as long as you are in the same sentence' ?
例如创建这样的东西。
This is a group of Text. There is no foo. There is no bar. There is no foo
or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you.
应该return: {'bar' :没有栏。 , 没有 bar 或 foo。 , 没有 foo 或 bar。, 没有适合你的 bar。}
尝试使用正则表达式进行搜索。您可以搜索关键字列表并使用否定列表来否定它们。 诀窍是编译一个正则表达式,在你的句子中搜索 'a negation word in somewhere before my keyword'。这意味着:
re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)
其中 \b
表示 'word boundary'。所以它是一个词,后面是乱码(.+
),后面是一个词。使用 format
我们将单词设置为否定词和关键字。 re.I
设置忽略案例标志。
现在有了你所有的例子和一些我认为你不想匹配 'Nonono this is not the right foo' 或 'Anonymus foo...' 的例子,我想到了以下,这应该给你一个起点:
import re
text = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana. This is a group of Text. There is no foo. There is no bar. There is no foo or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you. Nonono, this is the wrong foo. Nono this is also a wrong foo. Anonymous foo.'
keywords = ['foo']
negated = ['no', 'not']
phraselist = re.split(r'\.\s', text)
out = {}
for phrase in phraselist:
for keyword in keywords:
for neg in negated:
regex = re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)
if regex.search(phrase.lower()):
try:
if not phrase in out[keyword]:
out[keyword].append(phrase)
except KeyError:
out[keyword] = [phrase]
print(out)
expected = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo or banana. There is no foo. There is no foor or bar. There is no bar or foo.'
print(expected)
输出为:
{'foo': ['Not foo', 'Not No foo', 'Not foo or bar', 'No foo and no bar', 'Not ba
r or foo', 'Not bar or foo or banana', 'Not bar or banana or foo', 'Not bar, ban
ana, or foo', 'Not bar, foo, or banana', 'There is no foo', 'There is no foo or
bar', 'There is no bar or foo']}
Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar
or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, fo
o or banana. There is no foo. There is no foor or bar. There is no bar or foo.