如何使用 quanteda 进行命名实体识别 (NER)?
How to do named entity recognition (NER) using quanteda?
有一个带有文本的数据框
df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada")
没有任何预处理
如何提取像this
这样的名称实体识别
示例结果词
dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada")
您可以在没有 quanteda 的情况下使用 spacyr 包来执行此操作——您的链接文章中提到的 spaCy 库的包装器。
在这里,我稍微编辑了您的输入 data.frame。
df <- data.frame(id = c(1, 2),
text = c("My best friend John works at Google.",
"However he would like to work at Amazon as he likes to use Python and stay in Canada."),
stringsAsFactors = FALSE)
然后:
library("spacyr")
library("dplyr")
# -- need to do these before the next function will work:
# spacy_install()
# spacy_download_langmodel(model = "en_core_web_lg")
spacy_initialize(model = "en_core_web_lg")
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.0.10, language model: en_core_web_lg)
#> (python options: type = "condaenv", value = "spacy_condaenv")
txt <- df$text
names(txt) <- df$id
spacy_parse(txt, lemma = FALSE, entity = TRUE) %>%
entity_extract() %>%
group_by(doc_id) %>%
summarize(ner_words = paste(entity, collapse = ", "))
#> # A tibble: 2 x 2
#> doc_id ner_words
#> <chr> <chr>
#> 1 1 John, Google
#> 2 2 Amazon, Python, Canada
有一个带有文本的数据框
df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada")
没有任何预处理
如何提取像this
这样的名称实体识别示例结果词
dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada")
您可以在没有 quanteda 的情况下使用 spacyr 包来执行此操作——您的链接文章中提到的 spaCy 库的包装器。
在这里,我稍微编辑了您的输入 data.frame。
df <- data.frame(id = c(1, 2),
text = c("My best friend John works at Google.",
"However he would like to work at Amazon as he likes to use Python and stay in Canada."),
stringsAsFactors = FALSE)
然后:
library("spacyr")
library("dplyr")
# -- need to do these before the next function will work:
# spacy_install()
# spacy_download_langmodel(model = "en_core_web_lg")
spacy_initialize(model = "en_core_web_lg")
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.0.10, language model: en_core_web_lg)
#> (python options: type = "condaenv", value = "spacy_condaenv")
txt <- df$text
names(txt) <- df$id
spacy_parse(txt, lemma = FALSE, entity = TRUE) %>%
entity_extract() %>%
group_by(doc_id) %>%
summarize(ner_words = paste(entity, collapse = ", "))
#> # A tibble: 2 x 2
#> doc_id ner_words
#> <chr> <chr>
#> 1 1 John, Google
#> 2 2 Amazon, Python, Canada