删除文本文件中包含 Python 字符或字母字符串的单词
Removing words in text files containing a character or string of letters with Python
我有几行文本,想删除其中包含特殊字符或固定给定字符串的任何单词(在 python 中)。
示例:
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
# remove any word with {'amp', ':'}
out_lines = ['this is',
'that is bad',
'is a word']
我知道如何从给定的列表中删除单词,但无法删除包含特殊字符或出现的字母很少的单词。请让我知道,我会添加更多信息。
这是我用来删除所选单词的方法:
def remove_stop_words(lines):
stop_words = ['am', 'is', 'are']
results = []
for text in lines:
tmp = text.split(' ')
for stop_word in stop_words:
for x in range(0, len(tmp)):
if tmp[x] == stop_word:
tmp[x] = ''
results.append(" ".join(tmp))
return results
out_lines = remove_stop_words(in_lines)
这符合您的预期输出:
def remove_stop_words(lines):
stop_words = ['am', ':']
results = []
for text in lines:
tmp = text.split(' ')
for x in range(0, len(tmp)):
for st_w in stop_words:
if st_w in tmp[x]:
tmp[x] = ''
results.append(" ".join(tmp))
return results
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
def remove_words(in_list, bad_list):
out_list = []
for line in in_list:
words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
out_list.append(words)
return out_list
out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)
这个说法听起来很奇怪
word for word in line.split() if not any([phrase in word for phrase in bad_list])
一次完成所有艰苦的工作。它为应用于单个单词的 "bad" 列表中的每个短语创建一个 True
/False
值列表。 any
函数再次将此临时列表压缩为单个 True
/False
值,如果这是 False
则可以安全地将单词复制到基于行的输出中列表。
例如,删除所有包含 a
的单词的结果如下所示:
remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']
(也可以删除 for line in ..
行。到那时,可读性 确实 开始受到影响。)
我有几行文本,想删除其中包含特殊字符或固定给定字符串的任何单词(在 python 中)。
示例:
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
# remove any word with {'amp', ':'}
out_lines = ['this is',
'that is bad',
'is a word']
我知道如何从给定的列表中删除单词,但无法删除包含特殊字符或出现的字母很少的单词。请让我知道,我会添加更多信息。
这是我用来删除所选单词的方法:
def remove_stop_words(lines):
stop_words = ['am', 'is', 'are']
results = []
for text in lines:
tmp = text.split(' ')
for stop_word in stop_words:
for x in range(0, len(tmp)):
if tmp[x] == stop_word:
tmp[x] = ''
results.append(" ".join(tmp))
return results
out_lines = remove_stop_words(in_lines)
这符合您的预期输出:
def remove_stop_words(lines):
stop_words = ['am', ':']
results = []
for text in lines:
tmp = text.split(' ')
for x in range(0, len(tmp)):
for st_w in stop_words:
if st_w in tmp[x]:
tmp[x] = ''
results.append(" ".join(tmp))
return results
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
def remove_words(in_list, bad_list):
out_list = []
for line in in_list:
words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
out_list.append(words)
return out_list
out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)
这个说法听起来很奇怪
word for word in line.split() if not any([phrase in word for phrase in bad_list])
一次完成所有艰苦的工作。它为应用于单个单词的 "bad" 列表中的每个短语创建一个 True
/False
值列表。 any
函数再次将此临时列表压缩为单个 True
/False
值,如果这是 False
则可以安全地将单词复制到基于行的输出中列表。
例如,删除所有包含 a
的单词的结果如下所示:
remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']
(也可以删除 for line in ..
行。到那时,可读性 确实 开始受到影响。)