如何按组计算字符串中字符的频率？

Question

My data.frame 包含有关个人完成的动作的信息以及代表数据库中这些动作的字符串（由字母字符组成）。它的结构如下：

MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")

我希望为每个 Person 捕获所有 Strings 中每个字母（例如：A、a、B、b）的频率。有 48 个大小写字母。我的实际 data.frame 包含 100 多个人的动作，因此迭代每个人的快速解决方案将是理想的。例如，我的预期输出是：

Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")

谢谢！

Answer 1

一个选项是使用 data.table

library(data.table)
df2 <- setDT(df1)[,list(Letter={
   tmp <- unlist(strsplit(Strings, ''))
   factor(tmp[tmp %in% c("A", "a", "B", "b")], 
        levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
#   Person Letter Frequency
#1:    Ben      A         1
#2:    Ben      a         0
#3:    Ben      B         1
#4:    Ben      b         2
#5:  Sally      A         1
#6:  Sally      a         1
#7:  Sally      B         1
#8:  Sally      b         0

Answer 2

不如 akrun 的回答那么神奇，但我认为它有效：

your.func <- function(data) {
    require(dplyr)
    bag.of.letters <- function(strings) {
        concat.string <- paste(strings, collapse='')
        all.chars.vec <- unlist(strsplit(concat.string,""))
        result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
        colnames(result) <- c("Letter","Frequency")
        result[order(result[["Letter"]]),]
    }
    lapply(X = unique(data[["Person"]]), 
           FUN = function(n) {
               strings = data %>% filter(Person == n) %>% .[["Strings"]]
               data.frame(Person = n, bag.of.letters(strings))
           }) %>% do.call(rbind,.)
}

your.func(MovementAnalysis)

如果您只想在 Letter 列中包含频率为正的字母，请删除 factor(..., levels = c(letters,LETTERS)) 部分。

Answer 3

这是一个使用我的 "splitstackshape" 包中的 cSplit_e 的选项。我将它与 "magrittr" 组合在一起，这样您就可以逐步完成这些步骤，而不必存储任何中间对象或创建长嵌套表达式。

第一个选项显示了如何获取 "wide" 表单，如@alistaire 所述。

library(splitstackshape)
library(magrittr)

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.)))
#    Person Strings_a Strings_A Strings_b Strings_B
# 1:  Sally         1         1         0         1
# 2:    Ben         0         1         2         1

要从上面转到长格式，您只需添加 melt 行。

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.))) %>%
  melt(id.vars = "Person")
#    Person  variable value
# 1:  Sally Strings_a     1
# 2:    Ben Strings_a     0
# 3:  Sally Strings_A     1
# 4:    Ben Strings_A     1
# 5:  Sally Strings_b     0
# 6:    Ben Strings_b     2
# 7:  Sally Strings_B     1
# 8:    Ben Strings_B     1

从你的问题中不清楚，但如果你将数据限制为 "A"、"a"、"B" 和 "b" 只是为了illustration 并且你确实对完整的 48 个选项感兴趣，那么你也可以省略以下行：

subset(select = grep("Person|_[AaBb]$", names(.)))

如何按组计算字符串中字符的频率？

How do I count the frequency of a character within a string, by a group?

string

r

frequency