使用日期列在 R 中聚合,但按标识符列

Aggregate in R with date column but by the identifier column

我想 aggregate(=总结)我的数据根据​​一个 id 变量。尽管如此,在那之后日期列只获得 NA,我认为是因为它被设置为 "Date"。

我想保留日期。

数据(第 10 个 obs):

          TUCASEID AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
2   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
3   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
4   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
5   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
6   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
7   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
8   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
9   2.00301e+13  60    1     1 2003    2003             5      03Jan2003
10  2.00301e+13  41    0     0 2003    2003             6      04Jan2003

然后,我用聚合总结一下:

timeuse_2003_mean <- aggregate(timeuse_2003[,c("AGE","MALE","BLACK","YEAR","DATASET","INTERVIEW_DAY","INTERVIEW_DATE")],
      by=list(timeuse_2003$TUCASEID),mean)

此处输出:

  TUCASEID         AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE
1   2.0030100e+13  60    1     1 2003    2003             5             NA
2   2.0030100e+13  41    0     0 2003    2003             6             NA
3   2.0030100e+13  26    0     0 2003    2003             6             NA
4   2.0030100e+13  36    0     1 2003    2003             4             NA
5   2.0030100e+13  51    1     0 2003    2003             4             NA
6   2.0030100e+13  32    0     0 2003    2003             4             NA
7   2.0030100e+13  44    0     0 2003    2003             1             NA
8   2.0030100e+13  21    0     0 2003    2003             2             NA
9   2.0030100e+13  33    0     0 2003    2003             6             NA
10  2.0030100e+13  39    0     1 2003    2003             4             NA

我收到一条警告消息,可能是因为日期的格式为 "as.Date",但我确实需要这种格式,并且他们也通过 "aggregate" 获得 "summarized" .

提前致谢。

我认为您需要的与您尝试过的相反。尝试:

aggregate(TUCASEID~., df, mean)

#  AGE MALE BLACK YEAR DATASET INTERVIEW_DAY INTERVIEW_DATE TUCASEID
#1  60    1     1 2003    2003             5      03Jan2003    2e+13
#2  41    0     0 2003    2003             6      04Jan2003    2e+13

数据

df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13, 
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 
2.00301e+13, 2.00301e+13), AGE = c(60L, 60L, 60L, 60L, 60L, 60L, 
60L, 60L, 60L, 41L), MALE = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 0L), BLACK = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L), YEAR = c(2003L, 
2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L
), DATASET = c(2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 
2003L, 2003L, 2003L), INTERVIEW_DAY = c(5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 6L), INTERVIEW_DATE = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("03Jan2003", "04Jan2003"), class = 
"factor")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

分两步完成:

首先,我通过标识符变量 TUCASEID:

用各个变量的总和总结了数据集
timeuse_2003_sum <- aggregate(timeuse_2003[,c("CHILD_CARE_BASIC","CHILD_CARE_TEACH",
                                              "CHILD_CARE_PLAY", ,"OTHER")],
                              by=list(timeuse_2003$TUCASEID),sum_col)

timeuse_2003_sum$TUCASEID <- timeuse_2003_sum$Group.1

timeuse_2003_sum$Group.1 <- NULL

timeuse_2003_sum <- subset(timeuse_2003_sum, select=c(38,1:37))

其次,我用各个变量的均值总结了数据集。这次我不仅包括标识符 TUCASEID 作为要汇总的组,还包括日期变量 INTERVIEW_DATE:

 timeuse_2003_mean <- aggregate(timeuse_2003[,c("TUCASEID","AGE","MALE","BLACK","MARRIED",
                                   by=list(timeuse_2003$TUCASEID, timeuse_2003$INTERVIEW_DATE),mean)

    timeuse_2003_mean$TUCASEID <- timeuse_2003_mean$Group.1

    timeuse_2003_mean$INTERVIEW_DATE <- timeuse_2003_mean$Group.2

    timeuse_2003_mean$Group.1 <- NULL

    timeuse_2003_mean$Group.2 <- NULL

最后,我通过标识符TUCASEID合并了两个汇总数据:

##################################################################
##     Appending Summary Statistics to single dataset again     ##
##################################################################

timeuse_2003_Summary <- merge(timeuse_2003_mean, timeuse_2003_sum, by = "TUCASEID", all.y = TRUE)