删除文本文件中包含 Python 字符或字母字符串的单词

Question

我有几行文本，想删除其中包含特殊字符或固定给定字符串的任何单词（在 python 中）。

示例：

in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

# remove any word with {'amp', ':'}
out_lines = ['this is', 
             'that is bad', 
             'is a word']

我知道如何从给定的列表中删除单词，但无法删除包含特殊字符或出现的字母很少的单词。请让我知道，我会添加更多信息。

这是我用来删除所选单词的方法：

def remove_stop_words(lines):
   stop_words = ['am', 'is', 'are']
   results = []
   for text in lines:
        tmp = text.split(' ')
        for stop_word in stop_words:
            for x in range(0, len(tmp)):
               if tmp[x] == stop_word:
                  tmp[x] = ''
        results.append(" ".join(tmp))
   return results
out_lines = remove_stop_words(in_lines)

Answer 1

这符合您的预期输出：

def remove_stop_words(lines):
  stop_words = ['am', ':']
  results = []
  for text in lines:
    tmp = text.split(' ')
    for x in range(0, len(tmp)):
      for st_w in stop_words:
        if st_w in tmp[x]:
          tmp[x] = ''
    results.append(" ".join(tmp))
  return results

Answer 2

in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

def remove_words(in_list, bad_list):
    out_list = []
    for line in in_list:
        words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
        out_list.append(words)
    return out_list

out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)

这个说法听起来很奇怪

word for word in line.split() if not any([phrase in word for phrase in bad_list])

一次完成所有艰苦的工作。它为应用于单个单词的 "bad" 列表中的每个短语创建一个 True/False 值列表。 any 函数再次将此临时列表压缩为单个 True/False 值，如果这是 False 则可以安全地将单词复制到基于行的输出中列表。

例如，删除所有包含 a 的单词的结果如下所示：

remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']

（也可以删除 for line in .. 行。到那时，可读性确实开始受到影响。）

删除文本文件中包含 Python 字符或字母字符串的单词

Removing words in text files containing a character or string of letters with Python

python

string

nlp

corpus

special-characters