R:如何使用索引列聚合数据框?
R: How to aggregate a dataframe using an index column?
我有一个如下所示的数据框:
head(test_df, n =15)
# print the first 15rows of the dataframe
value frequency index
1 -2.90267705917358 1 1
2 -2.90254878997803 1 1
3 -2.90252590179443 1 1
4 -2.90219354629517 1 1
5 -2.90201354026794 1 1
6 -2.9016375541687 1 1
7 -2.90107154846191 1 1
8 -2.90089440345764 1 1
9 -2.89996957778931 1 1
10 -2.89970088005066 1 1
11 -2.89928865432739 1 2
12 -2.89920520782471 1 2
13 -2.89907360076904 1 2
14 -2.89888191223145 1 2
15 -2.8988630771637 1 2
数据框有 3 列和 61819 行。要聚合数据框,我想获取具有相同 'index' 的所有行的 'value' 和 'frequency' 列的平均值。
我已经找到了一些有用的链接,请参阅:
https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/
但是,我还没有解决问题。
test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)
index value frequency
1 1 NA 1
2 2 NA 1
3 3 NA 1
4 4 NA 1
5 5 NA 1
6 6 NA 1
7 7 NA 1
8 8 NA 1
9 9 NA 1
10 10 NA 1
11 11 NA 1
12 12 NA 1
13 13 NA 1
14 14 NA 1
15 15 NA 1
因为我只得到列 'value' 的 NA 值,我想知道它是否 可能只是数据类型问题?! 但是当我尝试转换时我失败的数据类型...
base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"
试试 tidyverse。 test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T))
.
哦,当然,您应该确保您已经检查了数据的质量并了解数据集中任何 NA 的假设和原因。
1. 这是一个基本的 R 解决方案。
aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
# index value frequency
#1 1 -2.901523 1
#2 2 -2.899062 1
2. 和一个简单的 dplyr
解决方案。
library(dplyr)
test_df %>%
group_by(index) %>%
summarize(across(1:2, mean))
## A tibble: 2 x 3
# index value frequency
#* <int> <dbl> <dbl>
#1 1 -2.90 1
#2 2 -2.90 1
数据
test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803,
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687,
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066,
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145,
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
使用data.table
library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]
我有一个如下所示的数据框:
head(test_df, n =15)
# print the first 15rows of the dataframe
value frequency index
1 -2.90267705917358 1 1
2 -2.90254878997803 1 1
3 -2.90252590179443 1 1
4 -2.90219354629517 1 1
5 -2.90201354026794 1 1
6 -2.9016375541687 1 1
7 -2.90107154846191 1 1
8 -2.90089440345764 1 1
9 -2.89996957778931 1 1
10 -2.89970088005066 1 1
11 -2.89928865432739 1 2
12 -2.89920520782471 1 2
13 -2.89907360076904 1 2
14 -2.89888191223145 1 2
15 -2.8988630771637 1 2
数据框有 3 列和 61819 行。要聚合数据框,我想获取具有相同 'index' 的所有行的 'value' 和 'frequency' 列的平均值。
我已经找到了一些有用的链接,请参阅:
https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/
但是,我还没有解决问题。
test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)
index value frequency
1 1 NA 1
2 2 NA 1
3 3 NA 1
4 4 NA 1
5 5 NA 1
6 6 NA 1
7 7 NA 1
8 8 NA 1
9 9 NA 1
10 10 NA 1
11 11 NA 1
12 12 NA 1
13 13 NA 1
14 14 NA 1
15 15 NA 1
因为我只得到列 'value' 的 NA 值,我想知道它是否 可能只是数据类型问题?! 但是当我尝试转换时我失败的数据类型...
base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"
试试 tidyverse。 test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T))
.
哦,当然,您应该确保您已经检查了数据的质量并了解数据集中任何 NA 的假设和原因。
1. 这是一个基本的 R 解决方案。
aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
# index value frequency
#1 1 -2.901523 1
#2 2 -2.899062 1
2. 和一个简单的 dplyr
解决方案。
library(dplyr)
test_df %>%
group_by(index) %>%
summarize(across(1:2, mean))
## A tibble: 2 x 3
# index value frequency
#* <int> <dbl> <dbl>
#1 1 -2.90 1
#2 2 -2.90 1
数据
test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803,
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687,
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066,
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145,
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
使用data.table
library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]