在 spacy NER 中区分国家和城市
Differentiate between countries and cities in spacy NER
我正在尝试使用 spacy NER 从组织地址中提取国家/地区,但是,它使用相同的标签标记国家/地区和城市 GPE
。有什么方法可以区分它们吗?
例如:
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
print(ent.text)
回馈
Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States
如前所述,GPE
实体预测 Countries, cities and states
,因此您将无法仅检测具有给定模型的国家/地区实体。
我建议简单地创建一个国家列表,然后检查 GPE
实体是否在这个列表中。
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']
for ent in doc.ents:
if ent.label_ == 'GPE':
# check if the value is in the list of countries
if ent.text in countries:
print(ent.text, '-- Country')
else:
print(ent.text, '-- City or State')
这将输出以下内容:
Tempe -- City or State
United States -- Country
Monterey -- City or State
United States -- Country
Tempe -- City or State
United States -- Country
United States -- Country
正如其他答案所提到的,预训练 Spacy 模型的 GPE 适用于国家、城市和州。但是,有一个解决方法,我相信可以使用多种方法。
一种方法:您可以向模型添加自定义标签。 Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from 上有篇好文章:
Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.
当您尝试训练单个单词时,它无法获得足够泛化的特征来检测这些实体。
一个更简单的解决方法如下:
pip install geonamescache
然后使用以下代码获取国家和城市列表
import geonamescache
gc = geonamescache.GeonamesCache()
# gets nested dictionary for countries
countries = gc.get_countries()
# gets nested dictionary for cities
cities = gc.get_cities()
文档指出您还可以获得许多其他位置选项。
使用以下函数从嵌套字典中获取具有特定名称的键的所有值(从此answer获得)
def gen_dict_extract(var, key):
if isinstance(var, dict):
for k, v in var.items():
if k == key:
yield v
if isinstance(v, (dict, list)):
yield from gen_dict_extract(v, key)
elif isinstance(var, list):
for d in var:
yield from gen_dict_extract(d, key)
分别加载cities
和countries
两个列表。
cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]
然后用下面的代码来区分:
nlp = spacy.load("en_core_web_sm")
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
if ent.text in countries:
print(f"Country : {ent.text}")
elif ent.text in cities:
print(f"City : {ent.text}")
else:
print(f"Other GPE : {ent.text}")
输出:
City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
我正在尝试使用 spacy NER 从组织地址中提取国家/地区,但是,它使用相同的标签标记国家/地区和城市 GPE
。有什么方法可以区分它们吗?
例如:
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
print(ent.text)
回馈
Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States
如前所述,GPE
实体预测 Countries, cities and states
,因此您将无法仅检测具有给定模型的国家/地区实体。
我建议简单地创建一个国家列表,然后检查 GPE
实体是否在这个列表中。
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']
for ent in doc.ents:
if ent.label_ == 'GPE':
# check if the value is in the list of countries
if ent.text in countries:
print(ent.text, '-- Country')
else:
print(ent.text, '-- City or State')
这将输出以下内容:
Tempe -- City or State
United States -- Country
Monterey -- City or State
United States -- Country
Tempe -- City or State
United States -- Country
United States -- Country
正如其他答案所提到的,预训练 Spacy 模型的 GPE 适用于国家、城市和州。但是,有一个解决方法,我相信可以使用多种方法。
一种方法:您可以向模型添加自定义标签。 Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from
Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.
当您尝试训练单个单词时,它无法获得足够泛化的特征来检测这些实体。
一个更简单的解决方法如下:
pip install geonamescache
然后使用以下代码获取国家和城市列表
import geonamescache
gc = geonamescache.GeonamesCache()
# gets nested dictionary for countries
countries = gc.get_countries()
# gets nested dictionary for cities
cities = gc.get_cities()
文档指出您还可以获得许多其他位置选项。
使用以下函数从嵌套字典中获取具有特定名称的键的所有值(从此answer获得)
def gen_dict_extract(var, key):
if isinstance(var, dict):
for k, v in var.items():
if k == key:
yield v
if isinstance(v, (dict, list)):
yield from gen_dict_extract(v, key)
elif isinstance(var, list):
for d in var:
yield from gen_dict_extract(d, key)
分别加载cities
和countries
两个列表。
cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]
然后用下面的代码来区分:
nlp = spacy.load("en_core_web_sm")
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
if ent.text in countries:
print(f"Country : {ent.text}")
elif ent.text in cities:
print(f"City : {ent.text}")
else:
print(f"Other GPE : {ent.text}")
输出:
City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States