使用 python 中的短语和词干进行快速词典查找

Question

我正在 python 中构建文本 classifier，并且我有每个 class 的关键短语列表。例如，classes 可以是 "travel" 和 "science"，列表可以包含：

旅行："New York"、"South Korea"、"Seoul"等
科学："scientist"、"chemical"等

我正在寻找从 python 中的此类列表中匹配短语的最佳方法。

例如，文档的结果：

A famous scientist traveled from New York to Seoul, South Korea

应该是： "science": 1 "travel": 3

即使字符串的 "in" 运算符优化得很好，也有一些情况需要处理：

单词边界：在某些时候我可以在字典中包含 "to"，并且不想匹配 "tomorrow" 中的 "to"。标记化在这种情况下会起作用，但短语需要一些自定义逻辑，可能是标记列表中的子列表查找。
词干提取：当列表

是否有 python 库可以有效地处理这个问题？如果我需要从头开始实施，在性能方面处理上述问题的最佳方法是什么？

Answer 1

在这种情况下，一个简单的解决方案是使用字典理解：

s = "A famous scientist traveled from New York to Seoul, South Korea"
d = {"travel":["New York", "South Korea", "Seoul"], "science": ["scientist", "chemical"]}
final_results = {a:sum(i in s for i in b) for a, b in d.items()}

输出：

{'science': 1, 'travel': 3}

Answer 2

您尝试实现的是对词干的短语搜索。这是文本挖掘我认为并在搜索引擎中实现的任务。

首先你需要 tokenize 和 stemmer 函数。标记化可以就像：

def tokenize(string):
    return fiter(lambda x: len(x) < 1, remove_punctuation(string).split())

pypi 上有各种词干提取器。

您最终将得到如下函数：

def preprocess(string):
    return [stemmer(word) for word in tokenize(string)]

那么您要查找的函数如下所示：

from collections import Counter


def count(dictionary, phrase):
    counter = Count()
    phrase = preprocess(phrase)
    for topic, string in dictionary.items():
        stems = preprocess(string)
        indices = find(phrase, stem[0])
        for index in indices:
            found = True
            for stem in stems[1:]:
                if phrase[index + 1] == stem:
                   continue
                else:
                   found = False
                   break
            if found:
               counter[topic] +=1
    return counter

dictionary 变量包含以下信息：

旅行："New York"、"South Korea"、"Seoul"等
科学："scientist"、"chemical"等

使用 python 中的短语和词干进行快速词典查找

Fast lexicon lookup with phrases and stemming in python

python

nlp

text-mining