知道两个 dfm 之间哪些词不同的代码是什么?
What is the code to know which words are different between two dfm?
我有两个 dfm,我想知道它们之间有哪些单词 missing/different。
例如,
library(quanteda)
df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)
corpus1 <- corpus(df1, text_field = "Text")
df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")
dfm1 <- dfm(corpus1, remove_punct = TRUE)
dfm2 <- dfm(corpus2, remove_punct = TRUE)
我想看看 dfm2 中有哪些单词没有出现在 dfm1 中。非常感谢您的帮助!
这似乎可以做到:
corpus1 <- unlist(strsplit(df1$Text, "\s"))
corpus2 <- unlist(strsplit(df2$Text, "\s"))
去掉标点符号:
corpus1 <- gsub("[.;!?,]", "", corpus1)
corpus2 <- gsub("[.;!?,]", "", corpus2)
获取 corpus1
中但不在 corpus2
中的单词:
corpus1[!corpus1 %in% corpus2]
[1] "So" "Stack" "immensely" "useful" "Thank" "guys" "sort" "this"
[9] "out" "for"
上面的答案很有效。但是,我认为使用 dfm_select
:
可以做得更干净
dfm_select(dfm1, pattern = dfm2, selection = "remove")
#> Document-feature matrix of: 1 document, 10 features (0.0% sparse).
#> 1 x 10 sparse Matrix of class "dfm"
#> features
#> docs so stack immensely useful thank guys sort this out for
#> text1 1 1 1 1 1 1 1 1 1 1
Base R 单线:
unlist(strsplit(df1$Text, "\s+"))[!(unlist(strsplit(gsub("[[:punct:]]",
"",
tolower(df1$Text)),
"\s+")) %in%
(unlist(strsplit(gsub("[[:punct:]]", "", tolower(df2$Text)),
"\s+"))))]
使用的数据:
df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)
df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
问题是如何比较两个 (quanteda) dfm 对象的特征集,而不是重新发明一种标记文本的方法。
> setdiff(featnames(dfm1), featnames(dfm2))
[1] "so" "stack" "immensely" "useful" "thank" "guys"
[7] "sort" "this" "out" "for"
获取 dfm1 中 dfm2 中没有的功能。
@JBGruber 的回答也有效,但在即将发布的 v2 中,我们反对使用 dfm_select()
,其中 pattern
是另一个 dfm
我有两个 dfm,我想知道它们之间有哪些单词 missing/different。 例如,
library(quanteda)
df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)
corpus1 <- corpus(df1, text_field = "Text")
df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")
dfm1 <- dfm(corpus1, remove_punct = TRUE)
dfm2 <- dfm(corpus2, remove_punct = TRUE)
我想看看 dfm2 中有哪些单词没有出现在 dfm1 中。非常感谢您的帮助!
这似乎可以做到:
corpus1 <- unlist(strsplit(df1$Text, "\s"))
corpus2 <- unlist(strsplit(df2$Text, "\s"))
去掉标点符号:
corpus1 <- gsub("[.;!?,]", "", corpus1)
corpus2 <- gsub("[.;!?,]", "", corpus2)
获取 corpus1
中但不在 corpus2
中的单词:
corpus1[!corpus1 %in% corpus2]
[1] "So" "Stack" "immensely" "useful" "Thank" "guys" "sort" "this"
[9] "out" "for"
上面的答案很有效。但是,我认为使用 dfm_select
:
dfm_select(dfm1, pattern = dfm2, selection = "remove")
#> Document-feature matrix of: 1 document, 10 features (0.0% sparse).
#> 1 x 10 sparse Matrix of class "dfm"
#> features
#> docs so stack immensely useful thank guys sort this out for
#> text1 1 1 1 1 1 1 1 1 1 1
Base R 单线:
unlist(strsplit(df1$Text, "\s+"))[!(unlist(strsplit(gsub("[[:punct:]]",
"",
tolower(df1$Text)),
"\s+")) %in%
(unlist(strsplit(gsub("[[:punct:]]", "", tolower(df2$Text)),
"\s+"))))]
使用的数据:
df1 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)
df2 <- data.frame(Text = c("Whosebug is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
问题是如何比较两个 (quanteda) dfm 对象的特征集,而不是重新发明一种标记文本的方法。
> setdiff(featnames(dfm1), featnames(dfm2))
[1] "so" "stack" "immensely" "useful" "thank" "guys"
[7] "sort" "this" "out" "for"
获取 dfm1 中 dfm2 中没有的功能。
@JBGruber 的回答也有效,但在即将发布的 v2 中,我们反对使用 dfm_select()
,其中 pattern
是另一个 dfm