如何根据R中缺少数据的日期列计算数据框中多列的月平均值
How to calculate monthly average of multiple columns in dataframe based on Date column with missind data in R
我的数据框中有超过 4000 列的大量列。一列是日期,其余是公司(列名)。我有 14 年的每日观察(作为行)使其成为 164 个月。我想根据 Date 列计算每月平均值,最重要的是只有当每列至少有 15 个观察值时才计算平均值(公司) 否则应该 return NA.
df<- Spread
Date A B C
2000-01-04 0.062893082 0.030769231 NA
2000-01-05 0.062893082 0.015503876 NA
2000-01-06 0.062893082 NA NA
2000-01-07 0.062893082 NA NA
2000-01-10 0.062893082 NA NA
2000-01-11 0.062893082 NA NA
2000-01-12 0.062893082 NA NA
2000-01-13 0.062893082 NA NA
2000-01-14 0.062893082 NA NA
2000-01-17 0.052910053 NA NA
2000-01-18 0.031413613 NA NA
2000-01-19 0.052910053 NA NA
2000-01-20 0.051282051 NA NA
2000-01-21 0.051282051 0.014184397 NA
2000-01-24 0.051282051 0.014184397 NA
2000-01-25 0.051282051 0.014184397 NA
2000-01-26 0.051282051 0.014184397 NA
2000-01-27 0.051282051 0.019914651 NA
2000-01-28 0.031088083 0.028571429 NA
2000-01-31 0.031088083 0.028571429 NA
我想要的输出
Monthly<- df
Month A B C
Jan-2000 0.053656996 NA NA
非常感谢您的帮助。我想将这些值四舍五入到小数点后 4 位,例如 0.062893082 到 0.0628。
我们可以使用data.table
。我们将 'data.frame' 转换为 'data.table' (setDT(df1)
),然后我们使用 format
提取 month-year (转换为 Date
class).这可以用作分组变量。我们遍历列 (lapply(.SD,...
) 和 if
non-NA 元素的 length
大于或等于 15 得到 mean
或 else
return 作为 NA.
library(data.table)
setDT(df1)[,lapply(.SD, function(x) if(length(na.omit(x)) >=15)
mean(x, na.rm=TRUE) else NA_real_) ,
by = .(Month= format(as.IDate(Date), '%b-%Y'))]
# Month A B C
#1: Jan-2000 0.053657 NA NA
使用 dplyr
的类似方法是
library(dplyr)
df1 %>%
group_by(Month = format(as.Date(Date), '%b-%Y')) %>%
summarise_each(funs( if(length(na.omit(.))>=15)
mean(., na.rm=TRUE) else NA_real_), A:C)
# Month A B C
# (chr) (dbl) (dbl) (dbl)
#1 Jan-2000 0.053657 NA NA
数据
df1 <- structure(list(Date = c("2000-01-04", "2000-01-05", "2000-01-06",
"2000-01-07", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13",
"2000-01-14", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20",
"2000-01-21", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27",
"2000-01-28", "2000-01-31"), A = c(0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.062893082, 0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.052910053, 0.031413613, 0.052910053,
0.051282051, 0.051282051, 0.051282051, 0.051282051, 0.051282051,
0.051282051, 0.031088083, 0.031088083), B = c(0.030769231, 0.015503876,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.014184397, 0.014184397,
0.014184397, 0.014184397, 0.019914651, 0.028571429, 0.028571429
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), .Names = c("Date", "A", "B", "C"
), class = "data.frame", row.names = c(NA, -20L))
我的数据框中有超过 4000 列的大量列。一列是日期,其余是公司(列名)。我有 14 年的每日观察(作为行)使其成为 164 个月。我想根据 Date 列计算每月平均值,最重要的是只有当每列至少有 15 个观察值时才计算平均值(公司) 否则应该 return NA.
df<- Spread
Date A B C
2000-01-04 0.062893082 0.030769231 NA
2000-01-05 0.062893082 0.015503876 NA
2000-01-06 0.062893082 NA NA
2000-01-07 0.062893082 NA NA
2000-01-10 0.062893082 NA NA
2000-01-11 0.062893082 NA NA
2000-01-12 0.062893082 NA NA
2000-01-13 0.062893082 NA NA
2000-01-14 0.062893082 NA NA
2000-01-17 0.052910053 NA NA
2000-01-18 0.031413613 NA NA
2000-01-19 0.052910053 NA NA
2000-01-20 0.051282051 NA NA
2000-01-21 0.051282051 0.014184397 NA
2000-01-24 0.051282051 0.014184397 NA
2000-01-25 0.051282051 0.014184397 NA
2000-01-26 0.051282051 0.014184397 NA
2000-01-27 0.051282051 0.019914651 NA
2000-01-28 0.031088083 0.028571429 NA
2000-01-31 0.031088083 0.028571429 NA
我想要的输出
Monthly<- df
Month A B C
Jan-2000 0.053656996 NA NA
非常感谢您的帮助。我想将这些值四舍五入到小数点后 4 位,例如 0.062893082 到 0.0628。
我们可以使用data.table
。我们将 'data.frame' 转换为 'data.table' (setDT(df1)
),然后我们使用 format
提取 month-year (转换为 Date
class).这可以用作分组变量。我们遍历列 (lapply(.SD,...
) 和 if
non-NA 元素的 length
大于或等于 15 得到 mean
或 else
return 作为 NA.
library(data.table)
setDT(df1)[,lapply(.SD, function(x) if(length(na.omit(x)) >=15)
mean(x, na.rm=TRUE) else NA_real_) ,
by = .(Month= format(as.IDate(Date), '%b-%Y'))]
# Month A B C
#1: Jan-2000 0.053657 NA NA
使用 dplyr
的类似方法是
library(dplyr)
df1 %>%
group_by(Month = format(as.Date(Date), '%b-%Y')) %>%
summarise_each(funs( if(length(na.omit(.))>=15)
mean(., na.rm=TRUE) else NA_real_), A:C)
# Month A B C
# (chr) (dbl) (dbl) (dbl)
#1 Jan-2000 0.053657 NA NA
数据
df1 <- structure(list(Date = c("2000-01-04", "2000-01-05", "2000-01-06",
"2000-01-07", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13",
"2000-01-14", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20",
"2000-01-21", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27",
"2000-01-28", "2000-01-31"), A = c(0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.062893082, 0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.052910053, 0.031413613, 0.052910053,
0.051282051, 0.051282051, 0.051282051, 0.051282051, 0.051282051,
0.051282051, 0.031088083, 0.031088083), B = c(0.030769231, 0.015503876,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.014184397, 0.014184397,
0.014184397, 0.014184397, 0.019914651, 0.028571429, 0.028571429
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), .Names = c("Date", "A", "B", "C"
), class = "data.frame", row.names = c(NA, -20L))