搜索引擎的基于列表的命名实体识别：如何扩展？

Question

我正在为存储在 Solr 中的文档开发一个搜索引擎。

在用户查询中，我想检测命名实体（人、组织、城市...）。

示例查询是：

barack obama wife age

在这个查询中，我想检测 "barack obama" 是一个人。

由于查询不是真正的短语，经典 NER（Spacy、Stanford NER...）很难正常工作。所以，我采用了这种方法：

将在文档中找到的所有实体存储在字典中（按长度递减排序）

循环字典，查看用户查询是否包含实体

def find_entities(query,entities_dict):

    entities=[]
    new_query=query.lower()

    for entity in entities_dict:
        if find_substring(entity,new_query):
            entities.append({entity:entities_dict[entity]})
            new_query = re.sub(r'\b{}\b'.format(entity), '', new_query)
    return(new_query,entities)

目前，我的 Solr 索引中有大约 20 万个实体：创建字典需要几分钟；加载后，这种方法效果很好，速度快，不那么耗内存。

在不久的将来，我将拥有50-1亿个实体。

我认为将这些实体存储在内存中是不可能的。

如何改变我的方法？ 我正在寻找有关算法、 内存管理 和 数据结构 的建议。

Answer 1

一个明显的暴力解决方案就是让你的搜索索引分布式：你创建例如100 个节点，每个节点都有 100 万个实体的字典，你运行它们并行，然后合并结果。

另一种解决方案（可能是对拆分索引的补充）是让您的实体不在简单列表中，而是在 prefix tree (aka trie) or in a graph of Aho-Corasick 中。这些数据结构大大加快了子字符串搜索速度，因为它们试图在单次传递中将所有实体与您的查询匹配，利用事实许多实体中有相同的子字符串。

事实上，我已经使用 pyahocorasick 在简短的查询中查找了几百万个实体（电影、歌曲、演员等），而且它的扩展性似乎很好。形式上，Aho-Corasick 的时间复杂度不取决于实体总数，仅取决于具体查询中匹配的实体数。因此，如果搜索变慢（这不太可能），那么查看哪些实体生成大量误报匹配并将它们从索引中删除是有意义的。在我的例子中，在删除非常常见的实体（例如 "It"（这是电影名称！）之后，匹配器的速度会更快。

这是一个例子。首先，我们获取要搜索的实体（15K 个城市）：

pip install pyahocorasick
wget https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.6.zip
unzip simplemaps_worldcities_basicv1.6.zip

然后我们创建可以匹配实体（城市）的自动机：

import pandas as pd
import re
import ahocorasick
cities = pd.read_csv('worldcities.csv')

def preprocess(text):
    """
    Add underscores instead of non-alphanumeric characters 
    and on the word boundaries in order to tell words from word substrings.
    """
    return '_{}_'.format(re.sub('[^a-z0-9]', '_', text.lower()))

index = ahocorasick.Automaton()
for city in cities.city:
    index.add_word(preprocess(city), city)
index.make_automaton()
# this object can be pickled to disk and then loaded back

现在实际将此索引应用于文本中的查找实体：

def find_cities(text, searcher):
    result = dict()
    for end_index, city_name in searcher.iter(preprocess(text)):
        end = end_index - 1
        start = end - len(city_name)
        result[(start, end)] = city_name
    return result

print(find_cities( 'Tver’ is somewhere between Moscow and Saint Petersburg', index))
# {(0, 5): 'Tver’', (27, 33): 'Moscow', (38, 54): 'Saint Petersburg', (44, 54): 'Petersburg'} 
# the search takes about 0.1 ms

简单搜索给出相同的结果，但需要大约 10 毫秒：

for city in cities.city:
    idx = text.find(city)
    if idx >=0:
        print(idx, city)

这是 the notebook 和我的代码。

搜索引擎的基于列表的命名实体识别：如何扩展？

List-based Named Entity Recognition for search engine: how to scale?

memory

search

solr

nlp

named-entity-recognition