制作计数数据框

Make count data frame

我有我使用的数据,它是计数数据,即每个日期+时间组合代表一个数据点。所以我当前的数据框如下所示:

  DATE        TIME
1 2014-02-15  15:02
2 2014-02-15  15:12
3 2014-04-15  02:02
4 2014-05-15  11:02
5 2014-06-15  15:42
6 2014-06-15  16:02
....

现在我想要一个新的 DF 来计算特定日期每小时有多少个数据点。如下所示:

  DATE        HOUR    COUNT
1 2014-02-15  15      2
2 2014-04-15  02      1
3 2014-05-15  11      1
4 2014-06-15  15      1
5 2014-06-15  16      1
....

我想这样做是为了制作一个箱线图,其中 x = 一天中的小时数,y = 数据点数(超过一年)。尝试用嵌套的 for 循环来做,但没有用。


编辑:如果可能,没有数据点的 date/hour 组合应该在数据框中,但 COUNT = 0。

您可以通过几种方式做到这一点,但我怀疑最简单的方法是让您使用 table。使用 'table',您可以 return 日期的频率。这基本上只是数据框中日期的计数。

您可以在提取小时后执行相同的操作 - 您甚至可以通过执行 table(DF$DATE,DF$HOUR) 来嵌套它。使用 as.data.frame 会给你一个有点像你正在寻找的列表。

编辑添加:为了回应您对问题的编辑,您可以使用 factor 级别来获取 table 语句中的零级别。 table 通过将它们包含在输出中来尊重您的因子水平,即使在输入中找不到它(事实上,我相信 table 将输入强制转换为背面的因子)。

示例代码:

# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
                        TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))

# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)

# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
                         levels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"),
                         labels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"))

# Obtain the first table of interest
as.data.frame(table(date.data$DATE))

        Var1 Freq
1 2014-02-15    2
2 2014-04-15    1
3 2014-05-15    1
4 2014-06-15    2

# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))

         Var1 Var2 Freq
1  2014-02-15   00    0
2  2014-04-15   00    0
3  2014-05-15   00    0
4  2014-06-15   00    0
5  2014-02-15   01    0
6  2014-04-15   01    0
7  2014-05-15   01    0
8  2014-06-15   01    0
....

这就是您要找的吗?

options(stringsAsFactors = F)

data = read.table(text  = 
"                  1 2014-02-15  15:02
                   2 2014-02-15  15:12
                   3 2014-04-15  02:02
                   4 2014-05-15  11:02
                   5 2014-06-15  15:42
                   6 2014-06-15  16:02")


colnames(data) = c("index", "date", "time")

table(data$date)

 # 2014-02-15 2014-04-15 2014-05-15 2014-06-15 
 #     2          1          1          2 

table(data$date, data$time)

fz = table(data$date, substr(data$time, 1,2))
print(fz)   

 #            02 11 15 16
 # 2014-02-15  0  0  2  0
 # 2014-04-15  1  0  0  0
 # 2014-05-15  0  1  0  0
 # 2014-06-15  0  0  1  1

如果您想重塑数据,您可以执行以下操作:

library(reshape)

otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")

print(otherFormat)

#          date hour frequency
# 1  2014-02-15    2         0
# 2  2014-04-15    2         1
# 3  2014-05-15    2         0
# 4  2014-06-15    2         0
# 5  2014-02-15   11         0
# 6  2014-04-15   11         0
# 7  2014-05-15   11         1
# 8  2014-06-15   11         0
# 9  2014-02-15   15         2
# 10 2014-04-15   15         0
# 11 2014-05-15   15         0
# 12 2014-06-15   15         1
# 13 2014-02-15   16         0
# 14 2014-04-15   16         0
# 15 2014-05-15   16         0
# 16 2014-06-15   16         1

IMO,最易读的方式:

已编辑以回答您更新后的问题

library(dplyr)
library(stringr)

df <- date.data %>%
  group_by(
    DATE = as.Date(DATE), 
    HOUR = as.numeric(str_sub(TIME, 1, 2))
    ) %>%
  tally 

# create a data frame with all dates/hours
expand.grid(
  # include all dates from first to last
  DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
  HOUR = 0:23
) %>% 
  arrange(DATE) %>%
  left_join(df, by = c("DATE", "HOUR"))

附加选项如下。首先,您在 mutate() 中为小时创建一个列。然后,你计算 DATEhourcount() 中存在多少个数据点。取消分组数据后,您将连接两个数据框以创建所需的结果。 expand.grid() 部分创建日期和小时(00 到 23)的所有组合。因为你有 02 代表 2,所以我使用 c(paste0("0", 0:9), 10:23))。最后,在最后的 mutate().

中将 NA 替换为 0
library(dplyr)
library(stringi)
library(data.table)

mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
                       hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))

# A bit of outcome
#         DATE hour n
#1  2014-02-15   00 0
#2  2014-04-15   00 0
#3  2014-05-15   00 0
#4  2014-06-15   00 0
#5  2014-02-15   01 0

使用data.table,你可以做同样的操作。您为 hour 创建一个列,并通过 DATEhour 计算数据点的数量。然后,您想要将 temp 与包含 DATE 和小时(00 到 23)的所有组合的数据 table 合并。您可以使用 CJ() 创建数据 table。完成合并过程后,在计数 (total) 列中将 NA 替换为 0

setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\d+")][,
            list(total = .N), by = list(DATE, hour)] -> temp

merge(temp,
      CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
      by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]

#          DATE hour total
# 1: 2014-02-15   02     0
# 2: 2014-02-15   11     0
# 3: 2014-02-15   15     2
# 4: 2014-02-15   16     0
# 5: 2014-02-15   00     0

数据

mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205, 
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L, 
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42", 
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE", 
"TIME"), row.names = c(NA, -6L))