如何使用 R 检测一列字符中的模式和频率?

How to detect pattern and frequency in a column of characters, using R?

我有一个显示 "activity chain" 人的 df,看起来像这样(问题底部的片段):

head(agents)
   id                                                                                                                                                                leg_activity
1   9                                                                                      home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2  10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3  11                                                                                                                                                      home, work, adpt, home
4  96                                                                                                                                home, car, work, car, home, work, adpt, home
5  97                              home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101                                       home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home

我感兴趣的是检测 adpt 出现的模式。最简单的方法是使用 count() 函数,它给我一个频率 table 作为输出。不幸的是,这个结果会产生误导。

这是它的样子:

x                                 freq
home, adpt, work, adpt, home      2071
home, adpt, shop, adpt, home      653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home        492
home, adpt, work, pt, home        468
home, adpt, work, home            283

这种方法的问题是我无法在较长的 activity 链中检测到模式;例如:

 home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home

这个案例一开始就有一个activity链条,这个链条很频繁,但是随着后面的活动越来越多,就不计入count函数了。

有没有办法使用计数函数,同时考虑到单元格内部发生的情况? 因此,如果有一个 table 可以显示所有可能的组合及其频率,那将很有趣,如下所示:

x                                freq
home, adpt, home                 10
home, adpt, home, pt, work, home 4
home, pt, work, home             2

非常感谢您的帮助!

数据:

structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L, 
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L, 
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L, 
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L, 
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L, 
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L, 
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L, 
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L, 
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L, 
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L, 
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L, 
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L, 
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home", 
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home", 
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home", 
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home", 
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home", 
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home", 
"home, adpt, leisure, adpt, home, bike, outside, bike, home", 
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home", 
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home", 
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home", 
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home", 
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home", 
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home", 
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home", 
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home", 
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home", 
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home", 
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home", 
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home", 
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home", 
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home", 
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home", 
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home", 
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home", 
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home", 
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home", 
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home", 
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home", 
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home", 
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, adpt, leisure, pt, home", "home, leisure, adpt, home", 
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home", 
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home", 
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home", 
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home", 
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home", 
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home", 
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home", 
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home", 
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home", 
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home", 
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home", 
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home", 
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home", 
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home", 
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home", 
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home", 
"home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home", 
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home", 
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home", 
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home", 
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home", 
"home, adpt, education, walk, home, walk, education, walk, home, walk, home", 
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")

我不太确定你到底想做什么,但我知道你有兴趣检测 activity adpt 出现的模式。这通常在 NLP 中完成,下面是使用 tidytext 包的解决方案。我在所谓的 n-grams 中拆分了 leg_activity 列,即我按连续的单词序列拆分文本。两个连续词的序列称为 bi-gram,三个连续词 tri-gram 等。当我们计算这些 n-grams 时,我们了解哪些活动最常在 adpt 之前,哪些活动最常在 adpt 之后。

bi-grams的操作方法如下:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>% 
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

           bigram   n
1       home adpt 100
2       adpt home  97
3       work adpt  51
4       adpt work  48
5    leisure adpt  27
6      adpt other  26
7      other adpt  26
8    adpt leisure  24
9       adpt shop  22
10      shop adpt  13
11 adpt education  10
12 education adpt  10

所以 adpt 最常出现在 "home" 之前,而 "home" 也最常紧跟在 "adpt" 之后。如果我们对同时发生的三个活动感兴趣并且包括 "adpt" 我们可以对 tri-grams 做同样的事情:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>%  #n is the only thing that changed
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

                    bigram  n
1                work adpt home 42
2                adpt work adpt 40
3                home adpt work 36
4               home adpt other 22
5               adpt other adpt 21
6             home adpt leisure 20
7             leisure adpt home 19
8               other adpt home 18
9             adpt leisure adpt 16
10               adpt home adpt 15
11               home adpt shop 12
12                adpt home car 11
13               adpt home walk 11
14               adpt shop adpt 11
15          home adpt education 10
16          education adpt home  9
[list continues]

这个列表要长得多,因为现在有更多可能的组合。 Here link 如果您想了解更多信息,请访问有关 n-gram 的优秀教程。这是你想做的吗?