知道两个 dfm 之间哪些词不同的代码是什么？

Question

我有两个 dfm，我想知道它们之间有哪些单词 missing/different。例如，

library(quanteda)

df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)

corpus1 <- corpus(df1, text_field = "Text")

df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")

dfm1 <- dfm(corpus1, remove_punct = TRUE)

dfm2 <- dfm(corpus2, remove_punct = TRUE)

我想看看 dfm2 中有哪些单词没有出现在 dfm1 中。非常感谢您的帮助！

Answer 1

这似乎可以做到：

corpus1 <- unlist(strsplit(df1$Text, "\s"))
corpus2 <- unlist(strsplit(df2$Text, "\s"))

去掉标点符号：

corpus1 <- gsub("[.;!?,]", "", corpus1)
corpus2 <- gsub("[.;!?,]", "", corpus2)

获取 corpus1 中但不在 corpus2 中的单词：

corpus1[!corpus1 %in% corpus2]
 [1] "So"        "Stack"     "immensely" "useful"    "Thank"     "guys"      "sort"      "this"     
 [9] "out"       "for"

Answer 2

上面的答案很有效。但是，我认为使用 dfm_select:

可以做得更干净

dfm_select(dfm1, pattern = dfm2, selection = "remove")
#> Document-feature matrix of: 1 document, 10 features (0.0% sparse).
#> 1 x 10 sparse Matrix of class "dfm"
#>        features
#> docs    so stack immensely useful thank guys sort this out for
#>   text1  1     1         1      1     1    1    1    1   1   1

Answer 3

Base R 单线：

unlist(strsplit(df1$Text, "\s+"))[!(unlist(strsplit(gsub("[[:punct:]]",
                                                          "",
                                                          tolower(df1$Text)),
                           "\s+")) %in%
        (unlist(strsplit(gsub("[[:punct:]]", "", tolower(df2$Text)),
                         "\s+"))))]

使用的数据：

df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)

df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)

Answer 4

问题是如何比较两个 (quanteda) dfm 对象的特征集，而不是重新发明一种标记文本的方法。

> setdiff(featnames(dfm1), featnames(dfm2))
 [1] "so"        "stack"     "immensely" "useful"    "thank"     "guys"     
 [7] "sort"      "this"      "out"       "for"

获取 dfm1 中 dfm2 中没有的功能。

@JBGruber 的回答也有效，但在即将发布的 v2 中，我们反对使用 dfm_select()，其中 pattern 是另一个 dfm

知道两个 dfm 之间哪些词不同的代码是什么？

What is the code to know which words are different between two dfm?

r

quanteda