找到最常出现的值并指示相对频率

Question

我观察到 2 位调查参与者的 12 条回复。

data = data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2), response = c(2,2,3,3,6,3,6,7,3,1,4,3,3,3,6,4,2,6,7,3,2,1,5,6))
data
   id response
1   1        2
2   1        2
3   1        3
4   1        3
5   1        6
6   1        3
7   1        6
8   1        7
9   1        3
10  1        1
11  1        4
12  1        3
13  2        3
14  2        3
15  2        6
16  2        4
17  2        2
18  2        6
19  2        7
20  2        3
21  2        2
22  2        1
23  2        5
24  2        6

现在我想在每个调查参与者的数据中添加两件事： a) 本次调查参与者出现频率最高的值 b) 最频繁值的相对频率

如何使用 dplyr 添加这些东西：

data %>%
  group_by(id) %>%
  mutate(most_frequent_value = ?,
         relative_frequency_of_most_frequent_value = ?)

Answer 1

你可以试试：

table(data$id, data$response) %>%
  as.data.frame() %>%
  setNames(c("id", "response", "n")) %>%
  group_by(id) %>%
  slice_max(n, 1) %>%
  group_by(response) %>%
  filter(n() > 1) %>%
  mutate(ratio = c(n[1]/sum(n), n[2]/sum(n)))

#> # A tibble: 2 x 4
#> # Groups:   response [1]
#>   id    response     n ratio
#>   <fct> <fct>    <int> <dbl>
#> 1 1     3            5 0.625
#> 2 2     3            3 0.375

Answer 2

我可能会使用两步解决方案。首先，创建 frequency/relative 频率的 data.frame。然后加入它。我们使用slice(which.max())，因为它会return一行。使用 slice_max 可能 return 多行。

library(tidyverse)
# count by id, response, calculate rel frequency
# rename columns to make inner_join easier
freq_table <- dd %>%
  count(id, response) %>%
  group_by(id) %>%
  mutate(rel_freq = n / sum(n)) %>%
  select(id, most_frequent_response = response, rel_freq)

# inner join to sliced freq_table (grouping by id is preserved)
dd %>%
  inner_join(freq_table %>% slice(which.max(rel_freq)))

#    id response most_frequent_response  rel_freq
# 1   1        2                      3 0.4166667
# 2   1        2                      3 0.4166667
# 3   1        3                      3 0.4166667
# 4   1        3                      3 0.4166667
# 5   1        6                      3 0.4166667
# ...

Answer 3

这个有用吗：

data %>% group_by(id, response) %>% mutate(n = n()) %>% 
ungroup() %>% group_by(id) %>% 
mutate(most_frequent_value = response[n == max(n)][1], 
relative_frequency_of_most_frequent_value = max(n)/n())
# A tibble: 24 x 5
# Groups:   id [2]
      id response     n most_frequent_value relative_frequency_of_most_frequent_value
   <dbl>    <dbl> <int>               <dbl>                                     <dbl>
 1     1        2     2                   3                                     0.417
 2     1        2     2                   3                                     0.417
 3     1        3     5                   3                                     0.417
 4     1        3     5                   3                                     0.417
 5     1        6     2                   3                                     0.417
 6     1        3     5                   3                                     0.417
 7     1        6     2                   3                                     0.417
 8     1        7     1                   3                                     0.417
 9     1        3     5                   3                                     0.417
10     1        1     1                   3                                     0.417
11     1        4     1                   3                                     0.417
12     1        3     5                   3                                     0.417
13     2        3     3                   3                                     0.25 
14     2        3     3                   3                                     0.25 
15     2        6     3                   3                                     0.25 
16     2        4     1                   3                                     0.25 
17     2        2     2                   3                                     0.25 
18     2        6     3                   3                                     0.25 
19     2        7     1                   3                                     0.25 
20     2        3     3                   3                                     0.25 
21     2        2     2                   3                                     0.25 
22     2        1     1                   3                                     0.25 
23     2        5     1                   3                                     0.25 
24     2        6     3                   3                                     0.25 
>

找到最常出现的值并指示相对频率

Find the value that occurs most frequently and indicate relative frequency

r

transform

pipe

count

dplyr