使用日期列在 R 中聚合,但按标识符列
Aggregate in R with date column but by the identifier column
我想 aggregate
(=总结)我的数据根据一个 id 变量。尽管如此,在那之后日期列只获得 NA,我认为是因为它被设置为 "Date"。
我想保留日期。
数据(第 10 个 obs):
TUCASEID AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
2 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
3 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
4 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
5 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
6 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
7 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
8 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
9 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
10 2.00301e+13 41 0 0 2003 2003 6 04Jan2003
然后,我用聚合总结一下:
timeuse_2003_mean <- aggregate(timeuse_2003[,c("AGE","MALE","BLACK","YEAR","DATASET","INTERVIEW_DAY","INTERVIEW_DATE")],
by=list(timeuse_2003$TUCASEID),mean)
此处输出:
TUCASEID AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1 2.0030100e+13 60 1 1 2003 2003 5 NA
2 2.0030100e+13 41 0 0 2003 2003 6 NA
3 2.0030100e+13 26 0 0 2003 2003 6 NA
4 2.0030100e+13 36 0 1 2003 2003 4 NA
5 2.0030100e+13 51 1 0 2003 2003 4 NA
6 2.0030100e+13 32 0 0 2003 2003 4 NA
7 2.0030100e+13 44 0 0 2003 2003 1 NA
8 2.0030100e+13 21 0 0 2003 2003 2 NA
9 2.0030100e+13 33 0 0 2003 2003 6 NA
10 2.0030100e+13 39 0 1 2003 2003 4 NA
我收到一条警告消息,可能是因为日期的格式为 "as.Date"
,但我确实需要这种格式,并且他们也通过 "aggregate" 获得 "summarized" .
提前致谢。
我认为您需要的与您尝试过的相反。尝试:
aggregate(TUCASEID~., df, mean)
# AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE TUCASEID
#1 60 1 1 2003 2003 5 03Jan2003 2e+13
#2 41 0 0 2003 2003 6 04Jan2003 2e+13
数据
df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13), AGE = c(60L, 60L, 60L, 60L, 60L, 60L,
60L, 60L, 60L, 41L), MALE = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L), BLACK = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L), YEAR = c(2003L,
2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L
), DATASET = c(2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L,
2003L, 2003L, 2003L), INTERVIEW_DAY = c(5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 6L), INTERVIEW_DATE = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("03Jan2003", "04Jan2003"), class =
"factor")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
分两步完成:
首先,我通过标识符变量 TUCASEID
:
用各个变量的总和总结了数据集
timeuse_2003_sum <- aggregate(timeuse_2003[,c("CHILD_CARE_BASIC","CHILD_CARE_TEACH",
"CHILD_CARE_PLAY", ,"OTHER")],
by=list(timeuse_2003$TUCASEID),sum_col)
timeuse_2003_sum$TUCASEID <- timeuse_2003_sum$Group.1
timeuse_2003_sum$Group.1 <- NULL
timeuse_2003_sum <- subset(timeuse_2003_sum, select=c(38,1:37))
其次,我用各个变量的均值总结了数据集。这次我不仅包括标识符 TUCASEID
作为要汇总的组,还包括日期变量 INTERVIEW_DATE
:
timeuse_2003_mean <- aggregate(timeuse_2003[,c("TUCASEID","AGE","MALE","BLACK","MARRIED",
by=list(timeuse_2003$TUCASEID, timeuse_2003$INTERVIEW_DATE),mean)
timeuse_2003_mean$TUCASEID <- timeuse_2003_mean$Group.1
timeuse_2003_mean$INTERVIEW_DATE <- timeuse_2003_mean$Group.2
timeuse_2003_mean$Group.1 <- NULL
timeuse_2003_mean$Group.2 <- NULL
最后,我通过标识符TUCASEID
合并了两个汇总数据:
##################################################################
## Appending Summary Statistics to single dataset again ##
##################################################################
timeuse_2003_Summary <- merge(timeuse_2003_mean, timeuse_2003_sum, by = "TUCASEID", all.y = TRUE)
我想 aggregate
(=总结)我的数据根据一个 id 变量。尽管如此,在那之后日期列只获得 NA,我认为是因为它被设置为 "Date"。
我想保留日期。
数据(第 10 个 obs):
TUCASEID AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
2 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
3 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
4 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
5 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
6 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
7 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
8 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
9 2.00301e+13 60 1 1 2003 2003 5 03Jan2003
10 2.00301e+13 41 0 0 2003 2003 6 04Jan2003
然后,我用聚合总结一下:
timeuse_2003_mean <- aggregate(timeuse_2003[,c("AGE","MALE","BLACK","YEAR","DATASET","INTERVIEW_DAY","INTERVIEW_DATE")],
by=list(timeuse_2003$TUCASEID),mean)
此处输出:
TUCASEID AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1 2.0030100e+13 60 1 1 2003 2003 5 NA
2 2.0030100e+13 41 0 0 2003 2003 6 NA
3 2.0030100e+13 26 0 0 2003 2003 6 NA
4 2.0030100e+13 36 0 1 2003 2003 4 NA
5 2.0030100e+13 51 1 0 2003 2003 4 NA
6 2.0030100e+13 32 0 0 2003 2003 4 NA
7 2.0030100e+13 44 0 0 2003 2003 1 NA
8 2.0030100e+13 21 0 0 2003 2003 2 NA
9 2.0030100e+13 33 0 0 2003 2003 6 NA
10 2.0030100e+13 39 0 1 2003 2003 4 NA
我收到一条警告消息,可能是因为日期的格式为 "as.Date"
,但我确实需要这种格式,并且他们也通过 "aggregate" 获得 "summarized" .
提前致谢。
我认为您需要的与您尝试过的相反。尝试:
aggregate(TUCASEID~., df, mean)
# AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE TUCASEID
#1 60 1 1 2003 2003 5 03Jan2003 2e+13
#2 41 0 0 2003 2003 6 04Jan2003 2e+13
数据
df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13), AGE = c(60L, 60L, 60L, 60L, 60L, 60L,
60L, 60L, 60L, 41L), MALE = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L), BLACK = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L), YEAR = c(2003L,
2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L
), DATASET = c(2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L,
2003L, 2003L, 2003L), INTERVIEW_DAY = c(5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 6L), INTERVIEW_DATE = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("03Jan2003", "04Jan2003"), class =
"factor")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
分两步完成:
首先,我通过标识符变量 TUCASEID
:
timeuse_2003_sum <- aggregate(timeuse_2003[,c("CHILD_CARE_BASIC","CHILD_CARE_TEACH",
"CHILD_CARE_PLAY", ,"OTHER")],
by=list(timeuse_2003$TUCASEID),sum_col)
timeuse_2003_sum$TUCASEID <- timeuse_2003_sum$Group.1
timeuse_2003_sum$Group.1 <- NULL
timeuse_2003_sum <- subset(timeuse_2003_sum, select=c(38,1:37))
其次,我用各个变量的均值总结了数据集。这次我不仅包括标识符 TUCASEID
作为要汇总的组,还包括日期变量 INTERVIEW_DATE
:
timeuse_2003_mean <- aggregate(timeuse_2003[,c("TUCASEID","AGE","MALE","BLACK","MARRIED",
by=list(timeuse_2003$TUCASEID, timeuse_2003$INTERVIEW_DATE),mean)
timeuse_2003_mean$TUCASEID <- timeuse_2003_mean$Group.1
timeuse_2003_mean$INTERVIEW_DATE <- timeuse_2003_mean$Group.2
timeuse_2003_mean$Group.1 <- NULL
timeuse_2003_mean$Group.2 <- NULL
最后,我通过标识符TUCASEID
合并了两个汇总数据:
##################################################################
## Appending Summary Statistics to single dataset again ##
##################################################################
timeuse_2003_Summary <- merge(timeuse_2003_mean, timeuse_2003_sum, by = "TUCASEID", all.y = TRUE)