计算向量中子串所有可能组合的频率

Question

我有一个像 strings 这样的向量，我想计算由分隔符“|”分隔的每个值的频率以及它们的组合，例如下面 R.

中的 result

strings <- c('a', 'a|b', 'a|c', 'a|b|c|d')

# Calculate how many times 'a' is present, how many times 'a' and 'b', denoted 'ab', are present, etc. My goal is to be able to identify which combinations of substrings are most common.

result <- data.frame(substring = c('a', 'b', 'c', 'd', 'ab', 'ac', 'ad', 'bc', 'bd', 'abc', 'abd', 'abcd'),
                     frequency = c(1, .5, .5, .25, .5, .5, .25, .25, .25, .25, .25, .25))

Answer 1

首先，获取给定集合的每个子集称为 power set。

有一个包 rje 包含函数 powerSet 为给定向量生成此函数。

下面我们有一个函数，它将一个向量（假设它具有 OP 指示的形式（即竖线分隔和小写字母））作为输入，生成幂集，并最终确定每个子串的频率。

library(rje)

getFreqs <- function(v) {
    idx_lets <- Reduce(union, sapply(strsplit(v, ""), function(x) {
        match(x, letters, nomatch = 0)
    }))
    
    do.call(rbind, lapply(powerSet(letters[idx_lets])[-1], function(x) {
        data.frame(substring = paste(x, collapse = ""),
                   frequency = ifelse(length(x) > 1,
                                      length(Reduce(intersect, sapply(x, function(y) {
                                           which(grepl(y, v))
                                       }))),
                                      sum(grepl(x, v))) / length(v)
        )
    }))
}

这是一个例子：

getFreqs(strings)
   substring frequency
1          a      1.00
2          b      0.50
3         ab      0.50
4          c      0.50
5         ac      0.50
6         bc      0.00
7        abc      0.25
8          d      0.25
9         ad      0.25
10        bd      0.25
11       abd      0.25
12        cd      0.25
13       acd      0.25
14       bcd      0.25
15      abcd      0.25

计算向量中子串所有可能组合的频率

Calculate frequency of all possible combinations of substrings in a vector

string

combinations

r

frequency