如何使用 R 计算保持文本中顺序的成对词的频率?
How to count frequency of pair words keeping the order sequence in text using R?
所以我正在研究一种方法,可以获取在由一个字符分隔的一系列事件中发生的成对词的频率。示例:
Input:
"Start>Press1>Press2>PressQR>Exit"
"Start>PressA>Press2>PressQR>QuitL>Exit"
"Start>Press1>Press2>Press3>Exit"`
Output:
Start>Press1 2
Press1>Press2 2
Press2>PressQR 2
PressQR>Exit 1
Start>PressA 1
PressA>Press2 2
Press2>PressQR 1
PressQR>QuitL 1
QuitL>Exit 1
Press2>Press3 1
Press3>Exit 1
谢谢。
input <- c("Start>Press1>Press2>PressQR>Exit","Start>PressA>Press2>PressQR>QuitL>Exit","Start>Press1>Press2>Press3>Exit")
gen_pairs <- function(x)
{
x_split <- unlist(strsplit(x,">"))
paste(x_split[-length(x_split)],x_split[-1],sep=">")
}
all_pairs <- unlist(lapply(input,gen_pairs))
all_pairs_ctab <- table(all_pairs)
as.data.frame(all_pairs_ctab[match(unique(all_pairs),names(all_pairs_ctab))])
您可以使用 tidytext 包,它通过其 unnest_tokens
功能支持 ngram 标记化:
library(dplyr)
library(tidytext)
data.frame(text = c("Start>Press1>Press2>PressQR>Exit", "Start>PressA>Press2>PressQR>QuitL>Exit", "Start>Press1>Press2>Press3>Exit")) %>%
unnest_tokens(bigram, text, 'ngrams', n = 2, to_lower = FALSE) %>%
count(bigram)
#> # A tibble: 11 × 2
#> bigram n
#> <chr> <int>
#> 1 Exit Start 2
#> 2 Press1 Press2 2
#> 3 Press2 Press3 1
#> 4 Press2 PressQR 2
#> 5 Press3 Exit 1
#> 6 PressA Press2 1
#> 7 PressQR Exit 1
#> 8 PressQR QuitL 1
#> 9 QuitL Exit 1
#> 10 Start Press1 2
#> 11 Start PressA 1
或者,如果您愿意,可以对基础 tokenizers::tokenize_ngrams
函数和 table
.
执行相同的操作
做一个有向边表然后聚合:
edgelist <- do.call(rbind, lapply(strsplit(x,">"), function(x) cbind(head(x,-1), x[-1]) ))
aggregate(count ~ ., data.frame(edgelist,count=1), FUN=sum)
# X1 X2 count
#1 Press3 Exit 1
#2 PressQR Exit 1
#3 QuitL Exit 1
#4 Start Press1 2
#5 Press1 Press2 2
#6 PressA Press2 1
#7 Press2 Press3 1
#8 Start PressA 1
#9 Press2 PressQR 2
#10 PressQR QuitL 1
所以我正在研究一种方法,可以获取在由一个字符分隔的一系列事件中发生的成对词的频率。示例:
Input:
"Start>Press1>Press2>PressQR>Exit"
"Start>PressA>Press2>PressQR>QuitL>Exit"
"Start>Press1>Press2>Press3>Exit"`
Output:
Start>Press1 2
Press1>Press2 2
Press2>PressQR 2
PressQR>Exit 1
Start>PressA 1
PressA>Press2 2
Press2>PressQR 1
PressQR>QuitL 1
QuitL>Exit 1
Press2>Press3 1
Press3>Exit 1
谢谢。
input <- c("Start>Press1>Press2>PressQR>Exit","Start>PressA>Press2>PressQR>QuitL>Exit","Start>Press1>Press2>Press3>Exit")
gen_pairs <- function(x)
{
x_split <- unlist(strsplit(x,">"))
paste(x_split[-length(x_split)],x_split[-1],sep=">")
}
all_pairs <- unlist(lapply(input,gen_pairs))
all_pairs_ctab <- table(all_pairs)
as.data.frame(all_pairs_ctab[match(unique(all_pairs),names(all_pairs_ctab))])
您可以使用 tidytext 包,它通过其 unnest_tokens
功能支持 ngram 标记化:
library(dplyr)
library(tidytext)
data.frame(text = c("Start>Press1>Press2>PressQR>Exit", "Start>PressA>Press2>PressQR>QuitL>Exit", "Start>Press1>Press2>Press3>Exit")) %>%
unnest_tokens(bigram, text, 'ngrams', n = 2, to_lower = FALSE) %>%
count(bigram)
#> # A tibble: 11 × 2
#> bigram n
#> <chr> <int>
#> 1 Exit Start 2
#> 2 Press1 Press2 2
#> 3 Press2 Press3 1
#> 4 Press2 PressQR 2
#> 5 Press3 Exit 1
#> 6 PressA Press2 1
#> 7 PressQR Exit 1
#> 8 PressQR QuitL 1
#> 9 QuitL Exit 1
#> 10 Start Press1 2
#> 11 Start PressA 1
或者,如果您愿意,可以对基础 tokenizers::tokenize_ngrams
函数和 table
.
做一个有向边表然后聚合:
edgelist <- do.call(rbind, lapply(strsplit(x,">"), function(x) cbind(head(x,-1), x[-1]) ))
aggregate(count ~ ., data.frame(edgelist,count=1), FUN=sum)
# X1 X2 count
#1 Press3 Exit 1
#2 PressQR Exit 1
#3 QuitL Exit 1
#4 Start Press1 2
#5 Press1 Press2 2
#6 PressA Press2 1
#7 Press2 Press3 1
#8 Start PressA 1
#9 Press2 PressQR 2
#10 PressQR QuitL 1