制作计数数据框
Make count data frame
我有我使用的数据,它是计数数据,即每个日期+时间组合代表一个数据点。所以我当前的数据框如下所示:
DATE TIME
1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02
....
现在我想要一个新的 DF 来计算特定日期每小时有多少个数据点。如下所示:
DATE HOUR COUNT
1 2014-02-15 15 2
2 2014-04-15 02 1
3 2014-05-15 11 1
4 2014-06-15 15 1
5 2014-06-15 16 1
....
我想这样做是为了制作一个箱线图,其中 x = 一天中的小时数,y = 数据点数(超过一年)。尝试用嵌套的 for 循环来做,但没有用。
编辑:如果可能,没有数据点的 date/hour 组合应该在数据框中,但 COUNT = 0。
您可以通过几种方式做到这一点,但我怀疑最简单的方法是让您使用 table
。使用 'table',您可以 return 日期的频率。这基本上只是数据框中日期的计数。
您可以在提取小时后执行相同的操作 - 您甚至可以通过执行 table(DF$DATE,DF$HOUR)
来嵌套它。使用 as.data.frame
会给你一个有点像你正在寻找的列表。
编辑添加:为了回应您对问题的编辑,您可以使用 factor
级别来获取 table
语句中的零级别。 table
通过将它们包含在输出中来尊重您的因子水平,即使在输入中找不到它(事实上,我相信 table
将输入强制转换为背面的因子)。
示例代码:
# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))
# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)
# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
levels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"),
labels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"))
# Obtain the first table of interest
as.data.frame(table(date.data$DATE))
Var1 Freq
1 2014-02-15 2
2 2014-04-15 1
3 2014-05-15 1
4 2014-06-15 2
# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))
Var1 Var2 Freq
1 2014-02-15 00 0
2 2014-04-15 00 0
3 2014-05-15 00 0
4 2014-06-15 00 0
5 2014-02-15 01 0
6 2014-04-15 01 0
7 2014-05-15 01 0
8 2014-06-15 01 0
....
这就是您要找的吗?
options(stringsAsFactors = F)
data = read.table(text =
" 1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02")
colnames(data) = c("index", "date", "time")
table(data$date)
# 2014-02-15 2014-04-15 2014-05-15 2014-06-15
# 2 1 1 2
table(data$date, data$time)
fz = table(data$date, substr(data$time, 1,2))
print(fz)
# 02 11 15 16
# 2014-02-15 0 0 2 0
# 2014-04-15 1 0 0 0
# 2014-05-15 0 1 0 0
# 2014-06-15 0 0 1 1
如果您想重塑数据,您可以执行以下操作:
library(reshape)
otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")
print(otherFormat)
# date hour frequency
# 1 2014-02-15 2 0
# 2 2014-04-15 2 1
# 3 2014-05-15 2 0
# 4 2014-06-15 2 0
# 5 2014-02-15 11 0
# 6 2014-04-15 11 0
# 7 2014-05-15 11 1
# 8 2014-06-15 11 0
# 9 2014-02-15 15 2
# 10 2014-04-15 15 0
# 11 2014-05-15 15 0
# 12 2014-06-15 15 1
# 13 2014-02-15 16 0
# 14 2014-04-15 16 0
# 15 2014-05-15 16 0
# 16 2014-06-15 16 1
IMO,最易读的方式:
已编辑以回答您更新后的问题
library(dplyr)
library(stringr)
df <- date.data %>%
group_by(
DATE = as.Date(DATE),
HOUR = as.numeric(str_sub(TIME, 1, 2))
) %>%
tally
# create a data frame with all dates/hours
expand.grid(
# include all dates from first to last
DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
HOUR = 0:23
) %>%
arrange(DATE) %>%
left_join(df, by = c("DATE", "HOUR"))
附加选项如下。首先,您在 mutate()
中为小时创建一个列。然后,你计算 DATE
和 hour
在 count()
中存在多少个数据点。取消分组数据后,您将连接两个数据框以创建所需的结果。 expand.grid()
部分创建日期和小时(00 到 23)的所有组合。因为你有 02 代表 2,所以我使用 c(paste0("0", 0:9), 10:23))
。最后,在最后的 mutate()
.
中将 NA 替换为 0
library(dplyr)
library(stringi)
library(data.table)
mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))
# A bit of outcome
# DATE hour n
#1 2014-02-15 00 0
#2 2014-04-15 00 0
#3 2014-05-15 00 0
#4 2014-06-15 00 0
#5 2014-02-15 01 0
使用data.table,你可以做同样的操作。您为 hour
创建一个列,并通过 DATE
和 hour
计算数据点的数量。然后,您想要将 temp
与包含 DATE 和小时(00 到 23)的所有组合的数据 table 合并。您可以使用 CJ()
创建数据 table。完成合并过程后,在计数 (total
) 列中将 NA
替换为 0
。
setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\d+")][,
list(total = .N), by = list(DATE, hour)] -> temp
merge(temp,
CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]
# DATE hour total
# 1: 2014-02-15 02 0
# 2: 2014-02-15 11 0
# 3: 2014-02-15 15 2
# 4: 2014-02-15 16 0
# 5: 2014-02-15 00 0
数据
mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205,
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L,
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42",
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE",
"TIME"), row.names = c(NA, -6L))
我有我使用的数据,它是计数数据,即每个日期+时间组合代表一个数据点。所以我当前的数据框如下所示:
DATE TIME
1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02
....
现在我想要一个新的 DF 来计算特定日期每小时有多少个数据点。如下所示:
DATE HOUR COUNT
1 2014-02-15 15 2
2 2014-04-15 02 1
3 2014-05-15 11 1
4 2014-06-15 15 1
5 2014-06-15 16 1
....
我想这样做是为了制作一个箱线图,其中 x = 一天中的小时数,y = 数据点数(超过一年)。尝试用嵌套的 for 循环来做,但没有用。
编辑:如果可能,没有数据点的 date/hour 组合应该在数据框中,但 COUNT = 0。
您可以通过几种方式做到这一点,但我怀疑最简单的方法是让您使用 table
。使用 'table',您可以 return 日期的频率。这基本上只是数据框中日期的计数。
您可以在提取小时后执行相同的操作 - 您甚至可以通过执行 table(DF$DATE,DF$HOUR)
来嵌套它。使用 as.data.frame
会给你一个有点像你正在寻找的列表。
编辑添加:为了回应您对问题的编辑,您可以使用 factor
级别来获取 table
语句中的零级别。 table
通过将它们包含在输出中来尊重您的因子水平,即使在输入中找不到它(事实上,我相信 table
将输入强制转换为背面的因子)。
示例代码:
# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))
# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)
# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
levels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"),
labels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"))
# Obtain the first table of interest
as.data.frame(table(date.data$DATE))
Var1 Freq
1 2014-02-15 2
2 2014-04-15 1
3 2014-05-15 1
4 2014-06-15 2
# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))
Var1 Var2 Freq
1 2014-02-15 00 0
2 2014-04-15 00 0
3 2014-05-15 00 0
4 2014-06-15 00 0
5 2014-02-15 01 0
6 2014-04-15 01 0
7 2014-05-15 01 0
8 2014-06-15 01 0
....
这就是您要找的吗?
options(stringsAsFactors = F)
data = read.table(text =
" 1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02")
colnames(data) = c("index", "date", "time")
table(data$date)
# 2014-02-15 2014-04-15 2014-05-15 2014-06-15
# 2 1 1 2
table(data$date, data$time)
fz = table(data$date, substr(data$time, 1,2))
print(fz)
# 02 11 15 16
# 2014-02-15 0 0 2 0
# 2014-04-15 1 0 0 0
# 2014-05-15 0 1 0 0
# 2014-06-15 0 0 1 1
如果您想重塑数据,您可以执行以下操作:
library(reshape)
otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")
print(otherFormat)
# date hour frequency
# 1 2014-02-15 2 0
# 2 2014-04-15 2 1
# 3 2014-05-15 2 0
# 4 2014-06-15 2 0
# 5 2014-02-15 11 0
# 6 2014-04-15 11 0
# 7 2014-05-15 11 1
# 8 2014-06-15 11 0
# 9 2014-02-15 15 2
# 10 2014-04-15 15 0
# 11 2014-05-15 15 0
# 12 2014-06-15 15 1
# 13 2014-02-15 16 0
# 14 2014-04-15 16 0
# 15 2014-05-15 16 0
# 16 2014-06-15 16 1
IMO,最易读的方式:
已编辑以回答您更新后的问题
library(dplyr)
library(stringr)
df <- date.data %>%
group_by(
DATE = as.Date(DATE),
HOUR = as.numeric(str_sub(TIME, 1, 2))
) %>%
tally
# create a data frame with all dates/hours
expand.grid(
# include all dates from first to last
DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
HOUR = 0:23
) %>%
arrange(DATE) %>%
left_join(df, by = c("DATE", "HOUR"))
附加选项如下。首先,您在 mutate()
中为小时创建一个列。然后,你计算 DATE
和 hour
在 count()
中存在多少个数据点。取消分组数据后,您将连接两个数据框以创建所需的结果。 expand.grid()
部分创建日期和小时(00 到 23)的所有组合。因为你有 02 代表 2,所以我使用 c(paste0("0", 0:9), 10:23))
。最后,在最后的 mutate()
.
library(dplyr)
library(stringi)
library(data.table)
mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))
# A bit of outcome
# DATE hour n
#1 2014-02-15 00 0
#2 2014-04-15 00 0
#3 2014-05-15 00 0
#4 2014-06-15 00 0
#5 2014-02-15 01 0
使用data.table,你可以做同样的操作。您为 hour
创建一个列,并通过 DATE
和 hour
计算数据点的数量。然后,您想要将 temp
与包含 DATE 和小时(00 到 23)的所有组合的数据 table 合并。您可以使用 CJ()
创建数据 table。完成合并过程后,在计数 (total
) 列中将 NA
替换为 0
。
setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\d+")][,
list(total = .N), by = list(DATE, hour)] -> temp
merge(temp,
CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]
# DATE hour total
# 1: 2014-02-15 02 0
# 2: 2014-02-15 11 0
# 3: 2014-02-15 15 2
# 4: 2014-02-15 16 0
# 5: 2014-02-15 00 0
数据
mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205,
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L,
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42",
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE",
"TIME"), row.names = c(NA, -6L))