如何按 r 中的因子水平汇总数据

Question

我有以下数据，我想总结（min/max/mean/median/mode/sd 按因子级别的日期，即 cluster.kmeans 列

head(MS.DATA.IMPVAR.KMEANS,10)
     subscribers   arpu     handset3g    mou     rechargesum  cluster.kmeans
 1       105822 197704.10     19040 2854801.0      235430              5
 2        18210  34799.21      2856  419109.0       39820              6
 3        71351 133842.38     13056 2021183.0      157099              3
 4        44975 104681.58      9439 1303220.6      121697              2
 5        75860 133190.55     12605 1714640.8      144262              5
 6        63740 119389.91     11067 1651303.2      143333              1
 7        59368 117792.03     11747 1690910.7      136902              5
 8        40064  80427.09      7217  886214.5       89226              2
 9        51966  99385.52      9972 1407985.7      117353              5
 10       70811 141131.66     12362 1373104.7      158206              4

我尝试使用 dplyr，结果如下：

s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)   
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)    
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)    
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)

head(MS.DATA.STATS.KMEANS)
 A tibble: 6 × 7
   cluster.kmeans    variable       max      mean    median       min
           <fctr>       <chr>     <dbl>     <dbl>     <dbl>     <dbl>
 1              1        arpu  250153.5 164652.99 163718.33  88306.53
 2              1   handset3g   21809.0  13736.38  13598.00   6936.00
 3              1         mou 1143639.1 338834.54 313010.20 116523.59
 4              1 rechargesum  270169.0 173397.03 171897.00  89080.00
 5              1 subscribers   41428.0  26515.01  26321.00  13794.00
 6              2        arpu  163566.9  84552.09  82402.23  29477.03

我想在不使用 dplyr 的情况下以其他方式使用更少的代码行......使用基本 r 函数，如 by ..aggregate 等。 ..

Answer 1

不清楚代码行数少还是base R优先。但是，使用当前的 Hadleyverse 格式，我们可以将代码放在 %>% 中，并使用 separate 代替两个 sapply 步骤以使其更紧凑

library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
    group_by(cluster.kmeans) %>%
    summarise_all(funs(mean, median, min, max, sd)) %>%
    gather(key, value, -cluster.kmeans) %>%
    separate(key, into = c("variable", "stats")) %>% 
    spread(stats, value)

如何按 r 中的因子水平汇总数据

How to summarize the data by factor levels in r

r

data-manipulation