R data.frame 中多个变量的每小时平均值?
Hourly mean of multiple variables in R data.frame?
我有以下代码,正在尝试找到每个 variables (i.e., X,Y, and Z)
的 hourly mean
。我的输出应该是 data.frame
,其中 hourlyDate
列和所有 variables
的 mean hourly data
。任何前进的方式将不胜感激。
library(lubridate)
set.seed(123)
T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
T$Date <- format(T$Datetime, format="%Y-%m-%d")
T$Hour <- format(T$Datetime, format = "%H")
T$Mints <- format(T$Datetime, format = "%M")
这里有一个 tidyverse 方法:
library(dplyr)
group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
#> # A tibble: 8,737 x 4
#> # Groups: Date [8,714]
#> Date X Y Z
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2011-01-01 00:00:00 8.00 7.90 6.90
#> 2 2011-01-01 01:00:00 7.93 7.47 7.90
#> 3 2011-01-01 02:00:00 7.83 6.89 7.67
#> 4 2011-01-01 03:00:00 6.61 7.92 7.18
#> 5 2011-01-01 04:00:00 7.27 7.20 6.48
#> 6 2011-01-01 05:00:00 7.88 6.80 7.69
#> 7 2011-01-01 06:00:00 7.07 8.05 7.52
#> 8 2011-01-01 07:00:00 7.40 7.92 6.99
#> 9 2011-01-01 08:00:00 7.97 7.76 7.26
#> 10 2011-01-01 09:00:00 7.57 7.47 6.94
#> # ... with 8,727 more rows
尝试:
library(lubridate)
library(dplyr)
set.seed(123)
T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
T %>% mutate(hourlyDate = floor_date(Datetime,unit='hour')) %>%
select(-Datetime) %>% group_by(hourlyDate) %>%
summarize(across(everything(),mean)) %>%
ungroup()
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 8,737 x 4
#> hourlyDate X Y Z
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2011-01-01 00:00:00 8.00 7.90 6.90
#> 2 2011-01-01 01:00:00 7.93 7.47 7.90
#> 3 2011-01-01 02:00:00 7.83 6.89 7.67
#> 4 2011-01-01 03:00:00 6.61 7.92 7.18
#> 5 2011-01-01 04:00:00 7.27 7.20 6.48
#> 6 2011-01-01 05:00:00 7.88 6.80 7.69
#> 7 2011-01-01 06:00:00 7.07 8.05 7.52
#> 8 2011-01-01 07:00:00 7.40 7.92 6.99
#> 9 2011-01-01 08:00:00 7.97 7.76 7.26
#> 10 2011-01-01 09:00:00 7.57 7.47 6.94
#> # ... with 8,727 more rows
由 reprex package (v0.3.0)
于 2020-08-20 创建
库 lubridate
有一个 floor_date
函数可以将您的日期时间列修剪为指定的单位。
然后只需按您想要的变量的每小时时间戳进行总结
library(dplyr)
library(lubridate)
T %>%
group_by(hourlyDate = lubridate::floor_date(Datetime, unit = 'hours')) %>%
summarise(across(.cols = c(X,Y,Z), .fns = ~mean(.x, na.rm=TRUE), .names = "meanHourlyData_{.col}"))
顺便说一句,我建议不要使用 T
作为变量名,因为这也是 TRUE
的简写,可能会导致一些意外行为...
三基 R
解决方案是使用 split
、tapply
或 rowsum
结合 table
。后者特别快(比 dplyr
答案之一快 9 倍)。
tl;dr 是您得到以下计算时间
#R> Unit: milliseconds
#R> expr min lq mean median uq max neval
#R> split + sapply 563.9 577.4 636.1 649.8 680.7 697.1 10
#R> tapply + sapply 108.0 117.3 134.0 120.2 124.4 205.1 10
#R> rowsum + table 21.3 21.3 21.5 21.3 21.6 21.9 10
#R> dplyr 172.4 176.6 182.3 180.9 185.9 203.4 10
这是解决方案
# create date-hour column
T$DateH <- format(T$Datetime, format="%Y-%m-%d-%H")
# using split + sapply
options(digits = 3)
out_1 <- sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans)
head(t(out_1), 5)
#R> X Y Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48
# using tapply + sapply
out_2 <- sapply(c("X", "Y", "Z"),
function(var) c(tapply(T[[var]], T$DateH, mean)))
head(out_2)
#R> X Y Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48
# check that we get the same
all.equal(t(out_1), out_2, check.attributes = FALSE)
#R> [1] TRUE
# with rowsum + table
out_3 <- as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) /
rep(table(T$DateH), 3)
# check that we get the same
all.equal(out_2, out_3)
#R> [2] TRUE
# compare with dplyr solution
library(dplyr)
out_3 <- group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
# check that we get the same
all.equal(out_2, as.matrix(out_3[, c("X", "Y", "Z")]),
check.attributes = FALSE)
#R> [1] TRUE
# check computation time
library(microbenchmark)
microbenchmark(
`split + sapply` =
sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans),
`tapply + sapply` =
sapply(c("X", "Y", "Z"),
function(var) c(tapply(T[[var]], T$DateH, mean))),
`rowsum + table` =
as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) /
rep(table(T$DateH), 3),
`dplyr` =
group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")),
X, Y, Z), times = 10)
#R> Unit: milliseconds
#R> expr min lq mean median uq max neval
#R> split + sapply 563.9 577.4 636.1 649.8 680.7 697.1 10
#R> tapply + sapply 108.0 117.3 134.0 120.2 124.4 205.1 10
#R> rowsum + table 21.3 21.3 21.5 21.3 21.6 21.9 10
#R> dplyr 172.4 176.6 182.3 180.9 185.9 203.4 10
我假设 data.table
也可能会很快获得结果。最后,不要使用 T
作为变量名。 T
是 TRUE
!
的 shorthand
我有以下代码,正在尝试找到每个 variables (i.e., X,Y, and Z)
的 hourly mean
。我的输出应该是 data.frame
,其中 hourlyDate
列和所有 variables
的 mean hourly data
。任何前进的方式将不胜感激。
library(lubridate)
set.seed(123)
T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
T$Date <- format(T$Datetime, format="%Y-%m-%d")
T$Hour <- format(T$Datetime, format = "%H")
T$Mints <- format(T$Datetime, format = "%M")
这里有一个 tidyverse 方法:
library(dplyr)
group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
#> # A tibble: 8,737 x 4
#> # Groups: Date [8,714]
#> Date X Y Z
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2011-01-01 00:00:00 8.00 7.90 6.90
#> 2 2011-01-01 01:00:00 7.93 7.47 7.90
#> 3 2011-01-01 02:00:00 7.83 6.89 7.67
#> 4 2011-01-01 03:00:00 6.61 7.92 7.18
#> 5 2011-01-01 04:00:00 7.27 7.20 6.48
#> 6 2011-01-01 05:00:00 7.88 6.80 7.69
#> 7 2011-01-01 06:00:00 7.07 8.05 7.52
#> 8 2011-01-01 07:00:00 7.40 7.92 6.99
#> 9 2011-01-01 08:00:00 7.97 7.76 7.26
#> 10 2011-01-01 09:00:00 7.57 7.47 6.94
#> # ... with 8,727 more rows
尝试:
library(lubridate)
library(dplyr)
set.seed(123)
T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
T %>% mutate(hourlyDate = floor_date(Datetime,unit='hour')) %>%
select(-Datetime) %>% group_by(hourlyDate) %>%
summarize(across(everything(),mean)) %>%
ungroup()
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 8,737 x 4
#> hourlyDate X Y Z
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2011-01-01 00:00:00 8.00 7.90 6.90
#> 2 2011-01-01 01:00:00 7.93 7.47 7.90
#> 3 2011-01-01 02:00:00 7.83 6.89 7.67
#> 4 2011-01-01 03:00:00 6.61 7.92 7.18
#> 5 2011-01-01 04:00:00 7.27 7.20 6.48
#> 6 2011-01-01 05:00:00 7.88 6.80 7.69
#> 7 2011-01-01 06:00:00 7.07 8.05 7.52
#> 8 2011-01-01 07:00:00 7.40 7.92 6.99
#> 9 2011-01-01 08:00:00 7.97 7.76 7.26
#> 10 2011-01-01 09:00:00 7.57 7.47 6.94
#> # ... with 8,727 more rows
由 reprex package (v0.3.0)
于 2020-08-20 创建库 lubridate
有一个 floor_date
函数可以将您的日期时间列修剪为指定的单位。
然后只需按您想要的变量的每小时时间戳进行总结
library(dplyr)
library(lubridate)
T %>%
group_by(hourlyDate = lubridate::floor_date(Datetime, unit = 'hours')) %>%
summarise(across(.cols = c(X,Y,Z), .fns = ~mean(.x, na.rm=TRUE), .names = "meanHourlyData_{.col}"))
顺便说一句,我建议不要使用 T
作为变量名,因为这也是 TRUE
的简写,可能会导致一些意外行为...
三基 R
解决方案是使用 split
、tapply
或 rowsum
结合 table
。后者特别快(比 dplyr
答案之一快 9 倍)。
tl;dr 是您得到以下计算时间
#R> Unit: milliseconds
#R> expr min lq mean median uq max neval
#R> split + sapply 563.9 577.4 636.1 649.8 680.7 697.1 10
#R> tapply + sapply 108.0 117.3 134.0 120.2 124.4 205.1 10
#R> rowsum + table 21.3 21.3 21.5 21.3 21.6 21.9 10
#R> dplyr 172.4 176.6 182.3 180.9 185.9 203.4 10
这是解决方案
# create date-hour column
T$DateH <- format(T$Datetime, format="%Y-%m-%d-%H")
# using split + sapply
options(digits = 3)
out_1 <- sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans)
head(t(out_1), 5)
#R> X Y Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48
# using tapply + sapply
out_2 <- sapply(c("X", "Y", "Z"),
function(var) c(tapply(T[[var]], T$DateH, mean)))
head(out_2)
#R> X Y Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48
# check that we get the same
all.equal(t(out_1), out_2, check.attributes = FALSE)
#R> [1] TRUE
# with rowsum + table
out_3 <- as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) /
rep(table(T$DateH), 3)
# check that we get the same
all.equal(out_2, out_3)
#R> [2] TRUE
# compare with dplyr solution
library(dplyr)
out_3 <- group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
# check that we get the same
all.equal(out_2, as.matrix(out_3[, c("X", "Y", "Z")]),
check.attributes = FALSE)
#R> [1] TRUE
# check computation time
library(microbenchmark)
microbenchmark(
`split + sapply` =
sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans),
`tapply + sapply` =
sapply(c("X", "Y", "Z"),
function(var) c(tapply(T[[var]], T$DateH, mean))),
`rowsum + table` =
as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) /
rep(table(T$DateH), 3),
`dplyr` =
group_by(T, Date, Hour) %>%
summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")),
X, Y, Z), times = 10)
#R> Unit: milliseconds
#R> expr min lq mean median uq max neval
#R> split + sapply 563.9 577.4 636.1 649.8 680.7 697.1 10
#R> tapply + sapply 108.0 117.3 134.0 120.2 124.4 205.1 10
#R> rowsum + table 21.3 21.3 21.5 21.3 21.6 21.9 10
#R> dplyr 172.4 176.6 182.3 180.9 185.9 203.4 10
我假设 data.table
也可能会很快获得结果。最后,不要使用 T
作为变量名。 T
是 TRUE
!