如何计算quanteda中每天的单词比例?
how to compute the proportion of words by day in quanteda?
考虑这个简单的例子
tibble(text = c('a grande latte with soy milk',
'black coffee no room',
'latte is a latte',
'coke, diet coke'),
myday = c(ymd('2018-01-01','2018-01-01','2018-01-03','2018-01-03'))) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
4 x 14 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room is coke , diet
text1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1 0 0 0 0
text3 1 0 2 0 0 0 0 0 0 0 1 0 0 0
text4 0 0 0 0 0 0 0 0 0 0 0 2 1 1
我有兴趣获取按天汇总的单词 coffee
的比例。
也就是第2018-01-01
天我们可以看到有10个词(a
grande
latte
with
soy
milk
black
coffee
no
room
) 和 coffee
只提到一次。所以比例是1/10。其他日子的推理相同。
如何在 quanteda
中做到这一点?当然,思路是避免将稀疏矩阵具体化为稠密矩阵。
谢谢!
这很简单,也是核心 quanteda 设计决策的一部分,通过您的 docvars 从语料库对象到 "downstream" 对象,例如 dfm。您可以通过 myday
docvar 使用 dfm_group()
然后加权来解决此问题。
首先,要使您的示例完全可重现,并为您的 dfm 对象指定一个名称:
library("quanteda")
## Package version: 1.4.3
library("tibble")
library("lubridate")
dfmat <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room",
"latte is a latte",
"coke, diet coke"
),
myday = c(ymd("2018-01-01", "2018-01-01", "2018-01-03", "2018-01-03"))
) %>%
corpus() %>%
tokens() %>%
dfm()
现在只需两次操作即可获得您想要的结果。
dfmat2 <- dfm_group(dfmat, groups = "myday") %>%
dfm_weight(scheme = "prop")
dfmat2
## Document-feature matrix of: 2 documents, 14 features (42.9% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room is
## 2018-01-01 0.100 0.1 0.10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0
## 2018-01-03 0.125 0 0.25 0 0 0 0 0 0 0 0.125
## features
## docs coke , diet
## 2018-01-01 0 0 0
## 2018-01-03 0.25 0.125 0.125
dfmat2[, "coffee"]
## Document-feature matrix of: 2 documents, 1 feature (50.0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
## features
## docs coffee
## 2018-01-01 0.1
## 2018-01-03 0
考虑这个简单的例子
tibble(text = c('a grande latte with soy milk',
'black coffee no room',
'latte is a latte',
'coke, diet coke'),
myday = c(ymd('2018-01-01','2018-01-01','2018-01-03','2018-01-03'))) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
4 x 14 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room is coke , diet
text1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1 0 0 0 0
text3 1 0 2 0 0 0 0 0 0 0 1 0 0 0
text4 0 0 0 0 0 0 0 0 0 0 0 2 1 1
我有兴趣获取按天汇总的单词 coffee
的比例。
也就是第2018-01-01
天我们可以看到有10个词(a
grande
latte
with
soy
milk
black
coffee
no
room
) 和 coffee
只提到一次。所以比例是1/10。其他日子的推理相同。
如何在 quanteda
中做到这一点?当然,思路是避免将稀疏矩阵具体化为稠密矩阵。
谢谢!
这很简单,也是核心 quanteda 设计决策的一部分,通过您的 docvars 从语料库对象到 "downstream" 对象,例如 dfm。您可以通过 myday
docvar 使用 dfm_group()
然后加权来解决此问题。
首先,要使您的示例完全可重现,并为您的 dfm 对象指定一个名称:
library("quanteda")
## Package version: 1.4.3
library("tibble")
library("lubridate")
dfmat <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room",
"latte is a latte",
"coke, diet coke"
),
myday = c(ymd("2018-01-01", "2018-01-01", "2018-01-03", "2018-01-03"))
) %>%
corpus() %>%
tokens() %>%
dfm()
现在只需两次操作即可获得您想要的结果。
dfmat2 <- dfm_group(dfmat, groups = "myday") %>%
dfm_weight(scheme = "prop")
dfmat2
## Document-feature matrix of: 2 documents, 14 features (42.9% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room is
## 2018-01-01 0.100 0.1 0.10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0
## 2018-01-03 0.125 0 0.25 0 0 0 0 0 0 0 0.125
## features
## docs coke , diet
## 2018-01-01 0 0 0
## 2018-01-03 0.25 0.125 0.125
dfmat2[, "coffee"]
## Document-feature matrix of: 2 documents, 1 feature (50.0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
## features
## docs coffee
## 2018-01-01 0.1
## 2018-01-03 0