在 spacy NER 中区分国家和城市

Differentiate between countries and cities in spacy NER

我正在尝试使用 spacy NER 从组织地址中提取国家/地区,但是,它使用相同的标签标记国家/地区和城市 GPE。有什么方法可以区分它们吗?

例如:

nlp = en_core_web_sm.load()

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        print(ent.text)

回馈

Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States

如前所述,GPE 实体预测 Countries, cities and states,因此您将无法仅检测具有给定模型的国家/地区实体。

我建议简单地创建一个国家列表,然后检查 GPE 实体是否在这个列表中。

nlp = en_core_web_sm.load()

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']

for ent in doc.ents:
    if ent.label_ == 'GPE':
        # check if the value is in the list of countries
        if ent.text in countries:
            print(ent.text, '-- Country')
        else:
            print(ent.text, '-- City or State')

这将输出以下内容:

Tempe -- City or State

United States -- Country

Monterey -- City or State

United States -- Country

Tempe -- City or State

United States -- Country

United States -- Country

正如其他答案所提到的,预训练 Spacy 模型的 GPE 适用于国家、城市和州。但是,有一个解决方法,我相信可以使用多种方法。

一种方法:您可以向模型添加自定义标签。 Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from 上有篇好文章:

Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.

当您尝试训练单个单词时,它无法获得足够泛化的特征来检测这些实体。

一个更简单的解决方法如下:

安装geonamescache

pip install geonamescache

然后使用以下代码获取国家和城市列表

import geonamescache

gc = geonamescache.GeonamesCache()

# gets nested dictionary for countries
countries = gc.get_countries()

# gets nested dictionary for cities
cities = gc.get_cities()

文档指出您还可以获得许多其他位置选项。

使用以下函数从嵌套字典中获取具有特定名称的键的所有值(从此answer获得)

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)

分别加载citiescountries两个列表。

cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]

然后用下面的代码来区分:

nlp = spacy.load("en_core_web_sm")

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        if ent.text in countries:
            print(f"Country : {ent.text}")
        elif ent.text in cities:
            print(f"City : {ent.text}")
        else:
            print(f"Other GPE : {ent.text}")

输出:

City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States