如何给 spacyr 喂食？

Question

考虑这个简单的例子

bogustib <- tibble(doc_id = c(1,2,3),
                   text = c('bug', 'one love', '838383838'))

# A tibble: 3 x 2
  doc_id text     
   <dbl> <chr>    
1      1 bug      
2      2 one love 
3      3 838383838

这个 tibble 被称为 bogustib 因为我知道 spacyr 会在第 3 行失败。

> spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "text1") : 
  replacement has 1 row, data has 0

所以，自然地，将 tibble 喂给 spacyr 也会失败

spacy_parse(bogustib, lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "3") : 
  replacement has 1 row, data has 0

我的问题是：我想我可以通过逐行调用 spacy_parse 来避免这个问题。

但是，这看起来效率很低，我想使用 spacyr 的 multithread 参数来加速我的大型 tibble.

的计算

这里有什么解决办法吗？谢谢！

Answer 1

实际上，它不会发生在我的环境中。在我的环境中，输出如下：

library(tidyverse)
library(spacyr)

bogustib <- tibble(doc_id = c(1,2,3),
                   text = c('bug', 'one love', '838383838'))

spacy_parse(bogustib)

spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
## No noun phrase found in documents.
##   doc_id sentence_id token_id     token pos     entity
## 1  text1           1        1 838383838 NUM CARDINAL_B

为了得到这个结果，我使用了 github 上的最新母版。但是，当我使用 CRAN 版本的 spacyr 运行时，我能够重现您的错误。我确定我刚才修复了这个错误，但这似乎没有反映在 CRAN 版本上。我们将尝试尽快更新 CRAN。

在此期间，您可以：

devtools::install_github('quanteda/spacyr')

或 zip 下载 repo 和运行:

devtools::install('******')

**** 是解压后的仓库路径。

如何给 spacyr 喂食？

how to feed a tibble to spacyr?

r

spacy

quanteda