如何按 r 中的因子水平汇总数据
How to summarize the data by factor levels in r
我有以下数据,我想总结(min/max/mean/median/mode/sd 按因子级别的日期,即 cluster.kmeans
列
head(MS.DATA.IMPVAR.KMEANS,10)
subscribers arpu handset3g mou rechargesum cluster.kmeans
1 105822 197704.10 19040 2854801.0 235430 5
2 18210 34799.21 2856 419109.0 39820 6
3 71351 133842.38 13056 2021183.0 157099 3
4 44975 104681.58 9439 1303220.6 121697 2
5 75860 133190.55 12605 1714640.8 144262 5
6 63740 119389.91 11067 1651303.2 143333 1
7 59368 117792.03 11747 1690910.7 136902 5
8 40064 80427.09 7217 886214.5 89226 2
9 51966 99385.52 9972 1407985.7 117353 5
10 70811 141131.66 12362 1373104.7 158206 4
我尝试使用 dplyr,结果如下:
s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)
head(MS.DATA.STATS.KMEANS)
A tibble: 6 × 7
cluster.kmeans variable max mean median min
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 arpu 250153.5 164652.99 163718.33 88306.53
2 1 handset3g 21809.0 13736.38 13598.00 6936.00
3 1 mou 1143639.1 338834.54 313010.20 116523.59
4 1 rechargesum 270169.0 173397.03 171897.00 89080.00
5 1 subscribers 41428.0 26515.01 26321.00 13794.00
6 2 arpu 163566.9 84552.09 82402.23 29477.03
我想在不使用 dplyr 的情况下以其他方式使用更少的代码行......使用基本 r 函数,如 by
..aggregate
等。 ..
不清楚代码行数少还是base R
优先。但是,使用当前的 Hadleyverse
格式,我们可以将代码放在 %>%
中,并使用 separate
代替两个 sapply
步骤以使其更紧凑
library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
group_by(cluster.kmeans) %>%
summarise_all(funs(mean, median, min, max, sd)) %>%
gather(key, value, -cluster.kmeans) %>%
separate(key, into = c("variable", "stats")) %>%
spread(stats, value)
我有以下数据,我想总结(min/max/mean/median/mode/sd 按因子级别的日期,即 cluster.kmeans
列
head(MS.DATA.IMPVAR.KMEANS,10)
subscribers arpu handset3g mou rechargesum cluster.kmeans
1 105822 197704.10 19040 2854801.0 235430 5
2 18210 34799.21 2856 419109.0 39820 6
3 71351 133842.38 13056 2021183.0 157099 3
4 44975 104681.58 9439 1303220.6 121697 2
5 75860 133190.55 12605 1714640.8 144262 5
6 63740 119389.91 11067 1651303.2 143333 1
7 59368 117792.03 11747 1690910.7 136902 5
8 40064 80427.09 7217 886214.5 89226 2
9 51966 99385.52 9972 1407985.7 117353 5
10 70811 141131.66 12362 1373104.7 158206 4
我尝试使用 dplyr,结果如下:
s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)
head(MS.DATA.STATS.KMEANS)
A tibble: 6 × 7
cluster.kmeans variable max mean median min
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 arpu 250153.5 164652.99 163718.33 88306.53
2 1 handset3g 21809.0 13736.38 13598.00 6936.00
3 1 mou 1143639.1 338834.54 313010.20 116523.59
4 1 rechargesum 270169.0 173397.03 171897.00 89080.00
5 1 subscribers 41428.0 26515.01 26321.00 13794.00
6 2 arpu 163566.9 84552.09 82402.23 29477.03
我想在不使用 dplyr 的情况下以其他方式使用更少的代码行......使用基本 r 函数,如 by
..aggregate
等。 ..
不清楚代码行数少还是base R
优先。但是,使用当前的 Hadleyverse
格式,我们可以将代码放在 %>%
中,并使用 separate
代替两个 sapply
步骤以使其更紧凑
library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
group_by(cluster.kmeans) %>%
summarise_all(funs(mean, median, min, max, sd)) %>%
gather(key, value, -cluster.kmeans) %>%
separate(key, into = c("variable", "stats")) %>%
spread(stats, value)