文本挖掘 - 计算短语的频率(一个以上的词)
Text Mining - Count Frequencies of Phrases (more than one word)
我熟悉使用 tm 库创建 tdm 和计算术语的频率。
但是这些术语都是单字的。
如何计算多词短语在文档 and/or 语料库中出现的次数?
编辑:
我正在将我现在拥有的代码添加到 improve/clarify 我的 post。
这是构建术语文档矩阵的非常标准的代码:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
问题是这个returns一个单词项的矩阵,例如:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
我希望能够在语料库中搜索多词术语。因此,例如 "came from" 而不是分别 "came" 和 "from"。
谢谢。
给出的文本:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
查找词频:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
对于模式的频率:
attr(regexpr('is', text), "match.length")
[1] 3
我创建了以下函数来获取单词 n-gram 及其对应的频率
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
给定一个像
这样的字符串
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
下面是如何为一对单词调用函数,对于长度为 3 的短语传递 3 作为参数
res <- createNgram(text, 2)
打印res
输出
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
这是一个使用 Tidytext 的代码示例:https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
相同的技术可以扩展到更大的 n 值。
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()
我熟悉使用 tm 库创建 tdm 和计算术语的频率。
但是这些术语都是单字的。
如何计算多词短语在文档 and/or 语料库中出现的次数?
编辑:
我正在将我现在拥有的代码添加到 improve/clarify 我的 post。
这是构建术语文档矩阵的非常标准的代码:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
问题是这个returns一个单词项的矩阵,例如:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
我希望能够在语料库中搜索多词术语。因此,例如 "came from" 而不是分别 "came" 和 "from"。
谢谢。
给出的文本:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
查找词频:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
对于模式的频率:
attr(regexpr('is', text), "match.length")
[1] 3
我创建了以下函数来获取单词 n-gram 及其对应的频率
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
给定一个像
这样的字符串text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
下面是如何为一对单词调用函数,对于长度为 3 的短语传递 3 作为参数
res <- createNgram(text, 2)
打印res
输出
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
这是一个使用 Tidytext 的代码示例:https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
相同的技术可以扩展到更大的 n 值。
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()