过滤相似行的文本文件
Filter Textfile for similar lines
在一个包含很多行的文本文件中,我需要提取所有以相似词开头且不唯一的行。
我寻找那些开头相同的行——它们可能具有相同的内容(重复行)或略有不同的内容(在第一个单词之后)。我希望这个例子能解释它。这将是此类文件的示例:
hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.
我正在寻找那些行:
hungarian-american
hungarian-german lied ms
hungarian-german ms
hungarian-speaking areas
hungarian-speaking regions
hungary
hungary and slovakia
hungary and slovakia
本例中丢弃的是
hungarica
hungry i
hunnis, william
hunt, l.
因为它们是独一无二的(不要以相似的词开头)。
我该如何尝试解决这个问题?我对 Python 和正则表达式有点熟悉,但也许有更简单的想法?感谢您的帮助!
这应该可以解决问题:
import re
from collections import defaultdict
dic = defaultdict(list)
lines = """hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.""".split('\n')
for line in lines:
# you should preferably use a word tokenizer such as the ones availables in NTLK
# but this line gives the idea
try:
first_word = re.split(',|;|-|\s', line)[0]
except IndexError:
continue
# Grouping similar lines
dic[first_word].append(line)
# Showing only similar lines which are not unique :
for word, lst in dic.items():
if len(lst) > 1:
print '\n'.join(lst)
在一个包含很多行的文本文件中,我需要提取所有以相似词开头且不唯一的行。 我寻找那些开头相同的行——它们可能具有相同的内容(重复行)或略有不同的内容(在第一个单词之后)。我希望这个例子能解释它。这将是此类文件的示例:
hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.
我正在寻找那些行:
hungarian-american
hungarian-german lied ms
hungarian-german ms
hungarian-speaking areas
hungarian-speaking regions
hungary
hungary and slovakia
hungary and slovakia
本例中丢弃的是
hungarica
hungry i
hunnis, william
hunt, l.
因为它们是独一无二的(不要以相似的词开头)。
我该如何尝试解决这个问题?我对 Python 和正则表达式有点熟悉,但也许有更简单的想法?感谢您的帮助!
这应该可以解决问题:
import re
from collections import defaultdict
dic = defaultdict(list)
lines = """hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.""".split('\n')
for line in lines:
# you should preferably use a word tokenizer such as the ones availables in NTLK
# but this line gives the idea
try:
first_word = re.split(',|;|-|\s', line)[0]
except IndexError:
continue
# Grouping similar lines
dic[first_word].append(line)
# Showing only similar lines which are not unique :
for word, lst in dic.items():
if len(lst) > 1:
print '\n'.join(lst)