如何在聚合数据上使用 quanteda?
how to use quanteda on aggregated data?
考虑这个例子
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
# A tibble: 2 x 2
text repetition
<chr> <dbl>
1 a grande latte with soy milk 100
2 black coffee no room 2
数据表示句子 a grande latte with soy milk
在我的数据集中出现了 100 次。当然,存储冗余是一种内存浪费,这就是为什么我有 repetition
变量。
不过,我还是希望从 quanteda 获得 dtm
来反映这一点,因为 dfm 的稀疏性为我提供了一些空间来保留该信息。也就是说,dfm 中的第一个文本如何仍然有 100 行?仅使用以下代码不会考虑 repetition
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2)) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1
假设您的 data.frame
名为 df1,您可以使用 cbind
向 dfm 添加一列。但这可能不会给您所需的结果。下面的其他两个选项可能更好。
cbind
df1 <- tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
my_dfm <- df1 %>%
corpus() %>%
tokens() %>%
dfm() %>%
cbind(repetition = df1$repetition) # add column to dfm with name repetition
Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room repetition
text1 1 1 1 1 1 1 0 0 0 0 100
text2 0 0 0 0 0 0 1 1 1 1 2
docvars
你也可以通过docvars
函数添加数据,然后数据被添加到dfm中,但更隐藏在dfm-class槽中(可通过@访问)。
docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)
repetition
text1 100
text2 2
乘法
使用乘法:
my_dfm * df1$repetition
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 100 100 100 100 100 100 0 0 0 0
text2 0 0 0 0 0 0 2 2 2 2
您可以使用索引来获得您想要的重复,同时保持只有单个文本的效率。
library("tibble")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
tib <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room"
),
repetition = c(100, 2)
)
dfmat <- corpus(tib) %>%
dfm()
定义一个函数来重复您的 "repetition" 变量:
repindex <- function(x) rep(seq_along(x), times = x)
然后重复对双文档dfm的索引:
dfmat2 <- dfmat[repindex(tib$repetition), ]
dfmat2
## Document-feature matrix of: 102 documents, 10 features (40.4% sparse).
head(dfmat2, 2)
## Document-feature matrix of: 2 documents, 10 features (40.0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
tail(dfmat2, 4)
## Document-feature matrix of: 4 documents, 10 features (50.0% sparse).
## 4 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
## text2 0 0 0 0 0 0 1 1 1 1
## text2 0 0 0 0 0 0 1 1 1 1
考虑这个例子
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
# A tibble: 2 x 2
text repetition
<chr> <dbl>
1 a grande latte with soy milk 100
2 black coffee no room 2
数据表示句子 a grande latte with soy milk
在我的数据集中出现了 100 次。当然,存储冗余是一种内存浪费,这就是为什么我有 repetition
变量。
不过,我还是希望从 quanteda 获得 dtm
来反映这一点,因为 dfm 的稀疏性为我提供了一些空间来保留该信息。也就是说,dfm 中的第一个文本如何仍然有 100 行?仅使用以下代码不会考虑 repetition
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2)) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1
假设您的 data.frame
名为 df1,您可以使用 cbind
向 dfm 添加一列。但这可能不会给您所需的结果。下面的其他两个选项可能更好。
cbind
df1 <- tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
my_dfm <- df1 %>%
corpus() %>%
tokens() %>%
dfm() %>%
cbind(repetition = df1$repetition) # add column to dfm with name repetition
Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room repetition
text1 1 1 1 1 1 1 0 0 0 0 100
text2 0 0 0 0 0 0 1 1 1 1 2
docvars
你也可以通过docvars
函数添加数据,然后数据被添加到dfm中,但更隐藏在dfm-class槽中(可通过@访问)。
docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)
repetition
text1 100
text2 2
乘法
使用乘法:
my_dfm * df1$repetition
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 100 100 100 100 100 100 0 0 0 0
text2 0 0 0 0 0 0 2 2 2 2
您可以使用索引来获得您想要的重复,同时保持只有单个文本的效率。
library("tibble")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
tib <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room"
),
repetition = c(100, 2)
)
dfmat <- corpus(tib) %>%
dfm()
定义一个函数来重复您的 "repetition" 变量:
repindex <- function(x) rep(seq_along(x), times = x)
然后重复对双文档dfm的索引:
dfmat2 <- dfmat[repindex(tib$repetition), ]
dfmat2
## Document-feature matrix of: 102 documents, 10 features (40.4% sparse).
head(dfmat2, 2)
## Document-feature matrix of: 2 documents, 10 features (40.0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
tail(dfmat2, 4)
## Document-feature matrix of: 4 documents, 10 features (50.0% sparse).
## 4 x 10 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room
## text1 1 1 1 1 1 1 0 0 0 0
## text1 1 1 1 1 1 1 0 0 0 0
## text2 0 0 0 0 0 0 1 1 1 1
## text2 0 0 0 0 0 0 1 1 1 1