如何使用 spacy 对 CSV 文件进行名称实体识别
How to use spacy to do Name Entity recognition on CSV file
我已经尝试了很多方法来对我的 csv 文件中的列进行名称实体识别,我尝试了 ne_chunk 但我无法在这样的列中获得 ne_chunk 的结果
ID STORY PERSON NE NP NN VB GE
1 Washington, a police officer James... 1 0 0 0 0 1
使用此代码后,
news=pd.read_csv("news.csv")
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
news['entityrecog']=news.apply(lambda row: nltk.ne_chunk(row['pos_tags']), axis=1)
tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())
news=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['entityrecognition'], axis=1)
news.to_csv("news.csv")
我遇到了这个错误
IndexError : list index out of range
所以,我想知道我是否可以使用 spaCy 来做到这一点,这是我不知道的另一件事。有人可以帮忙吗?
您似乎没有正确检查块,这就是为什么您会遇到关键错误的原因。我在猜测您想做什么,但这会为 NLTK 返回的每种 NER 类型创建新列。预定义每个 NER 类型的列并将其归零(因为如果 NER 不存在,这会给你 NaN )会更清晰一些。
def extract_ner_count(tagged):
entities = {}
chunks = nltk.ne_chunk(tagged)
for chunk in chunks:
if type(chunk) is nltk.Tree:
#if you don't need the entities, just add the label directly rather than this.
t = ''.join(c[0] for c in chunk.leaves())
entities[t] = chunk.label()
return Counter(entities.values())
news=pd.read_csv("news.csv")
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
news['entityrecognition']=news.apply(lambda row: extract_ner_count(row['pos_tags']), axis=1)
news = pd.concat([news, pd.DataFrame(list(news["entityrecognition"]))], axis=1)
print(news.head())
如果您只需要计数,则以下性能更高且没有 NaN:
tagger = nltk.PerceptronTagger()
chunker = nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER)
NE_Types = {'GPE', 'ORGANIZATION', 'LOCATION', 'GSP', 'O', 'FACILITY', 'PERSON'}
def extract_ner_count(text):
c = Counter()
chunks = chunker.parse(tagger.tag(nltk.word_tokenize(text,preserve_line=True)))
for chunk in chunks:
if type(chunk) is nltk.Tree:
c.update([chunk.label()])
return c
news=pd.read_csv("news.csv")
for NE_Type in NE_Types:
news[NE_Type] = 0
news.update(list(news["STORY"].apply(extract_ner_count)))
print(news.head())
我已经尝试了很多方法来对我的 csv 文件中的列进行名称实体识别,我尝试了 ne_chunk 但我无法在这样的列中获得 ne_chunk 的结果
ID STORY PERSON NE NP NN VB GE
1 Washington, a police officer James... 1 0 0 0 0 1
使用此代码后,
news=pd.read_csv("news.csv")
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
news['entityrecog']=news.apply(lambda row: nltk.ne_chunk(row['pos_tags']), axis=1)
tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())
news=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['entityrecognition'], axis=1)
news.to_csv("news.csv")
我遇到了这个错误
IndexError : list index out of range
所以,我想知道我是否可以使用 spaCy 来做到这一点,这是我不知道的另一件事。有人可以帮忙吗?
您似乎没有正确检查块,这就是为什么您会遇到关键错误的原因。我在猜测您想做什么,但这会为 NLTK 返回的每种 NER 类型创建新列。预定义每个 NER 类型的列并将其归零(因为如果 NER 不存在,这会给你 NaN )会更清晰一些。
def extract_ner_count(tagged):
entities = {}
chunks = nltk.ne_chunk(tagged)
for chunk in chunks:
if type(chunk) is nltk.Tree:
#if you don't need the entities, just add the label directly rather than this.
t = ''.join(c[0] for c in chunk.leaves())
entities[t] = chunk.label()
return Counter(entities.values())
news=pd.read_csv("news.csv")
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
news['entityrecognition']=news.apply(lambda row: extract_ner_count(row['pos_tags']), axis=1)
news = pd.concat([news, pd.DataFrame(list(news["entityrecognition"]))], axis=1)
print(news.head())
如果您只需要计数,则以下性能更高且没有 NaN:
tagger = nltk.PerceptronTagger()
chunker = nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER)
NE_Types = {'GPE', 'ORGANIZATION', 'LOCATION', 'GSP', 'O', 'FACILITY', 'PERSON'}
def extract_ner_count(text):
c = Counter()
chunks = chunker.parse(tagger.tag(nltk.word_tokenize(text,preserve_line=True)))
for chunk in chunks:
if type(chunk) is nltk.Tree:
c.update([chunk.label()])
return c
news=pd.read_csv("news.csv")
for NE_Type in NE_Types:
news[NE_Type] = 0
news.update(list(news["STORY"].apply(extract_ner_count)))
print(news.head())