来自大型文档集的 R 词频
R Term frequency from large document set
我有这样一个数据框
ID content
1 hello you how are you
1 you are ok
2 test
我需要根据 space 分隔的内容中每个单词的 id 获取频率。这基本上是在列中查找唯一项并查找按 Id
分组的频率和显示
ID hello you how are ok test
1 1 3 1 2 1 0
2 0 0 0 0 0 1
我试过了
test<- unique(unlist(strsplit(temp$val, split=" ")))
df<- cbind(temp, sapply(test, function(y) apply(temp, 1, function(x) as.integer(y %in% unlist(strsplit(x, split=" "))))))
这给出了我现在尝试分组的未分组解决方案,但我在内容中有超过 20000 个唯一值,有没有有效的方法来做到这一点?
您可以使用data.table
library(data.table)
setDT(df1)[, unlist(strsplit(content, split = " ")), by = ID
][, dcast(.SD, ID ~ V1)]
# ID are hello how ok test you
#1: 1 2 1 1 1 0 3
#2: 2 0 0 0 0 1 0
在第一部分中,我们按 ID
组使用 unlist(strsplit(content, split = " "))
,输出如下:
# ID V1
#1: 1 hello
#2: 1 you
#3: 1 how
#4: 1 are
#5: 1 you
#6: 1 you
#7: 1 are
#8: 1 ok
#9: 2 test
下一步我们使用dcast
将数据展开成宽格式。
数据
df1 <- structure(list(ID = c(1L, 1L, 2L), content = c("hello you how are you",
"you are ok", "test")), .Names = c("ID", "content"), class = "data.frame", row.names = c(NA,
-3L))
为文本挖掘制作的软件包怎么样?
# your data
text <- read.table(text = "
ID content
1 'hello you how are you'
1 'you are ok'
2 'test'", header = T, stringsAsFactors = FALSE) # remember the stringAsFactors life saver!
library(dplyr)
library(tidytext)
# here we put in column all the words
unnested <- text %>%
unnest_tokens(word, content)
# a classic data.frame from a table of frequencies
as.data.frame.matrix(table(unnested$ID, unnested$word))
are hello how ok test you
1 2 1 1 1 0 3
2 0 0 0 0 1 0
我有这样一个数据框
ID content
1 hello you how are you
1 you are ok
2 test
我需要根据 space 分隔的内容中每个单词的 id 获取频率。这基本上是在列中查找唯一项并查找按 Id
分组的频率和显示ID hello you how are ok test
1 1 3 1 2 1 0
2 0 0 0 0 0 1
我试过了
test<- unique(unlist(strsplit(temp$val, split=" ")))
df<- cbind(temp, sapply(test, function(y) apply(temp, 1, function(x) as.integer(y %in% unlist(strsplit(x, split=" "))))))
这给出了我现在尝试分组的未分组解决方案,但我在内容中有超过 20000 个唯一值,有没有有效的方法来做到这一点?
您可以使用data.table
library(data.table)
setDT(df1)[, unlist(strsplit(content, split = " ")), by = ID
][, dcast(.SD, ID ~ V1)]
# ID are hello how ok test you
#1: 1 2 1 1 1 0 3
#2: 2 0 0 0 0 1 0
在第一部分中,我们按 ID
组使用 unlist(strsplit(content, split = " "))
,输出如下:
# ID V1
#1: 1 hello
#2: 1 you
#3: 1 how
#4: 1 are
#5: 1 you
#6: 1 you
#7: 1 are
#8: 1 ok
#9: 2 test
下一步我们使用dcast
将数据展开成宽格式。
数据
df1 <- structure(list(ID = c(1L, 1L, 2L), content = c("hello you how are you",
"you are ok", "test")), .Names = c("ID", "content"), class = "data.frame", row.names = c(NA,
-3L))
为文本挖掘制作的软件包怎么样?
# your data
text <- read.table(text = "
ID content
1 'hello you how are you'
1 'you are ok'
2 'test'", header = T, stringsAsFactors = FALSE) # remember the stringAsFactors life saver!
library(dplyr)
library(tidytext)
# here we put in column all the words
unnested <- text %>%
unnest_tokens(word, content)
# a classic data.frame from a table of frequencies
as.data.frame.matrix(table(unnested$ID, unnested$word))
are hello how ok test you
1 2 1 1 1 0 3
2 0 0 0 0 1 0