使用 spacy 从文档中删除命名实体
Removing named entities from a document using spacy
我试图从文档中删除被 spacy 认为是命名实体的单词,因此基本上从字符串示例中删除了 "Sweden" 和 "Nokia"。我找不到解决实体存储为跨度的问题的方法。因此,当将它们与 spacy 文档中的单个标记进行比较时,它会提示错误。
在后面的步骤中,此过程应该是应用于存储在 pandas 数据框中的多个文本文档的函数。
对于如何更好地 post 问题的任何帮助和建议,我将不胜感激,因为这是我在这里的第一个问题。
nlp = spacy.load('en')
text_data = u'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
for word in document:
if word not in document.ents:
text_no_namedentities.append(word)
return " ".join(text_no_namedentities)
它会产生以下错误:
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)
这将为您提供所需的结果。查看 Named Entity Recognition 应该可以帮助您继续前进。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
输出:
This is a text document that speaks about entities like and
这不会处理涵盖多个标记的实体。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
输出
'New York is in'
此处USA
正确删除但无法删除New York
解决方案
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
输出
'is in'
您可以使用实体属性 start_char 和 end_char 将实体替换为空字符串。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [(e.start_char,e.end_char) for e in document.ents]
for ent in ents:
start_char, end_char = ent
text_data = text_data[:start_char] + text_data[end_char:]
print(text_data)
我对上述解决方案有疑问,
kochar96和APhillips的方案修改了文本,由于spacy的tokenization,所以不能--> join之后不能。
我不太理解 Batmobil 的解决方案,但遵循了使用开始和结束索引的一般思路。
打印输出中 hack-y numpy 解决方案的说明。 (没时间做更合理的,欢迎编辑完善)
text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data
print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)
keep_text = ''
for start_char, end_char in idx_keep:
keep_text += my_str[start_char:end_char]
print(keep_text)
my_ents=[(62, 68), (73, 78)]
[[ 0 62]
[68 73]
[78 -1]]
This can't be a text document that speaks about entities like and
我试图从文档中删除被 spacy 认为是命名实体的单词,因此基本上从字符串示例中删除了 "Sweden" 和 "Nokia"。我找不到解决实体存储为跨度的问题的方法。因此,当将它们与 spacy 文档中的单个标记进行比较时,它会提示错误。
在后面的步骤中,此过程应该是应用于存储在 pandas 数据框中的多个文本文档的函数。
对于如何更好地 post 问题的任何帮助和建议,我将不胜感激,因为这是我在这里的第一个问题。
nlp = spacy.load('en')
text_data = u'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
for word in document:
if word not in document.ents:
text_no_namedentities.append(word)
return " ".join(text_no_namedentities)
它会产生以下错误:
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)
这将为您提供所需的结果。查看 Named Entity Recognition 应该可以帮助您继续前进。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
输出:
This is a text document that speaks about entities like and
这不会处理涵盖多个标记的实体。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
输出
'New York is in'
此处USA
正确删除但无法删除New York
解决方案
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
输出
'is in'
您可以使用实体属性 start_char 和 end_char 将实体替换为空字符串。
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [(e.start_char,e.end_char) for e in document.ents]
for ent in ents:
start_char, end_char = ent
text_data = text_data[:start_char] + text_data[end_char:]
print(text_data)
我对上述解决方案有疑问, kochar96和APhillips的方案修改了文本,由于spacy的tokenization,所以不能--> join之后不能。
我不太理解 Batmobil 的解决方案,但遵循了使用开始和结束索引的一般思路。
打印输出中 hack-y numpy 解决方案的说明。 (没时间做更合理的,欢迎编辑完善)
text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data
print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)
keep_text = ''
for start_char, end_char in idx_keep:
keep_text += my_str[start_char:end_char]
print(keep_text)
my_ents=[(62, 68), (73, 78)]
[[ 0 62]
[68 73]
[78 -1]]
This can't be a text document that speaks about entities like and