根据时间序列对数据帧进行子集化
Subset a data frame based on a time sequence
我有一个名为 DF 的数据框,其中包含时间和日期列。我想根据这些列中的值对 DF 进行子集化。对于日期,我在 DATES 中有一个日期列表,并且正在对 DATES 中存在 DF$Date 的 DF 行进行子集化。目前,我想从 00:04:00 到 00:06:00 进行子集化。我不知道如何做后者。
理想情况下,我希望通过指定范围(如 00:04:00 至 00:06:00 以及指定起点和分钟数来进行子集化,如 [=18] =] 和 3 分钟(两种不同的方法)。我想这一切都归结为制作一个时间序列,并将这样的序列放在一个单独的向量中,然后我可以用它来进行匹配。
请注意,这只是一个可重现的简短示例。我正在寻找一种通用的方法来执行此操作,因为在实践中我想对大时间跨度进行子集化。另请注意,尽管在此示例中只有一个匹配日期,但实际上会有许多跨越多年的匹配日期。这就是为什么我认为不可能使用 POSIXlt
来制作时间序列。非常感谢。
#DF looks like this:
# DateTime XXX Time Date
#1371 2016-04-25 00:08:00 14 00:08:00 2016-04-25
#1372 2016-04-25 00:07:00 13 00:07:00 2016-04-25
#1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
#1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
#1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
#1376 2016-04-25 00:03:00 4 00:03:00 2016-04-25
#1377 2016-04-25 00:02:00 6 00:02:00 2016-04-25
#1387 2016-04-24 23:52:00 41 23:52:00 2016-04-24
#1388 2016-04-24 23:51:00 93 23:51:00 2016-04-24
#1389 2016-04-24 23:50:00 53 23:50:00 2016-04-24
#Code for DF, DATES, and to subset DF based on DATES
DF <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 0L, 59L, 58L, 57L, 56L, 55L, 54L, 53L, 52L, 51L, 50L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), mday = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L), mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), yday = c(115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Open = c(14, 13, 14, 3, 2, 4, 6, 4, 15, 15, 23, 24, 33, 14, 65, 54, 41, 93, 53), Time = c("00:08:00", "00:07:00", "00:06:00", "00:05:00", "00:04:00", "00:03:00", "00:02:00", "00:01:00", "00:00:00", "23:59:00", "23:58:00", "23:57:00", "23:56:00", "23:55:00", "23:54:00", "23:53:00", "23:52:00", "23:51:00", "23:50:00"), Date = structure(c(16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915), class = "Date")), .Names = c("DateTime", "XXX", "Time", "Date"), row.names = c("1371", "1372", "1373", "1374", "1375", "1376", "1377", "1378", "1379", "1380", "1381", "1382", "1383", "1384", "1385", "1386", "1387", "1388", "1389"), class = "data.frame")
DATES <- structure(c(12431, 12432, 10445, 10480, 11487, 12494, 12501, 12508, 13115, 13522, 14529, 15536, 16916, 16935), class = "Date")
SELEC <- DF[DF$Date %in% DATES,]
#Result of subsetting by Date:
# DateTime XXX Time Date
# 1371 2016-04-25 00:08:00 14 00:08:00 2016-04-25
# 1372 2016-04-25 00:07:00 13 00:07:00 2016-04-25
# 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
# 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
# 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
# 1376 2016-04-25 00:03:00 4 00:03:00 2016-04-25
# 1377 2016-04-25 00:02:00 6 00:02:00 2016-04-25
# 1378 2016-04-25 00:01:00 4 00:01:00 2016-04-25
# 1379 2016-04-25 00:00:00 15 00:00:00 2016-04-25
#How the final product would look like if using a larger data base spanning many years:
# DateTime XXX Time Date
#2016-04-25 00:06:00 13 00:06:00 2016-04-25
#2016-04-25 00:05:00 14 00:05:00 2016-04-25
#2016-04-25 00:04:00 3 00:04:00 2016-04-25
#2014-03-11 00:06:00 94 00:06:00 2014-03-11
#2014-03-11 00:05:00 6 00:05:00 2014-03-11
#2014-03-11 00:04:00 14 00:04:00 2014-03-11
#2011-08-06 00:06:00 13 00:06:00 2011-08-06
#2011-08-06 00:05:00 19 00:05:00 2011-08-06
#2011-08-06 00:04:00 41 00:04:00 2011-08-06
这个怎么样?
DF$Time <- strptime(DF$Time,format = '%H:%M:%S')
timeCondition <- (DF$Time >= strptime('00:04:00',format = '%H:%M:%S')) & (DF$Time <= strptime('00:06:00',format = '%H:%M:%S'))
SELEC <- DF[timeCondition & DF$Date %in% DATES,]
给出:
DateTime XXX Time Date
1373 2016-04-25 00:06:00 14 2016-05-14 00:06:00 2016-04-25
1374 2016-04-25 00:05:00 3 2016-05-14 00:05:00 2016-04-25
1375 2016-04-25 00:04:00 2 2016-05-14 00:04:00 2016-04-25
可能性一:字典序比较
如果所有时间值都存储为具有相同分隔符的零填充 24 小时字符串,例如 %H:%M:%S
,则可以使用词典顺序比较来应用过滤器。
DF[DF$Date%in%DATES & DF$Time>='00:04:00' & DF$Time<='00:06:00',];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
词典式解决方案当然不理想,因为它们不适合基于时间的数学运算,例如加、减、乘、除等。
更好的解决方案涉及将时间值转换为数字类型,该数字类型将持续时间编码为与显式或未指定的基准时间的偏移量。这就是流行的 date/time 库编码类型的方式,例如 Java 的 boost date_time for C++, Joda-Time 和 R 的 POSIXct、difftime 和 lubridate。
可能性 2:手动数字
可以自己解析字符串以构造表示持续时间的数字,并使用数字比较来应用过滤器。
hmsToDouble <- function(hms) as.double(substr(hms,1,2))*3600 + as.double(substr(hms,4,5))*60 + as.double(substr(hms,7,8));
DF[DF$Date%in%DATES & hmsToDouble(DF$Time)>=hmsToDouble('00:04:00') & hmsToDouble(DF$Time)<=hmsToDouble('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性三:POSIXt
我们可以生成 POSIXt(即 POSIXct 或 POSIXlt)值的向量,并对这些向量使用向量化比较。
DF[DF$Date%in%DATES & DF$DateTime>=as.POSIXct(paste0(DF$Date,' 00:04:00')) & DF$DateTime<=as.POSIXct(paste0(DF$Date,' 00:06:00')),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性4:difftime
R 中唯一内置的持续时间数据类型是 difftime 类型,使用起来可能有点挑剔。但是对于这个问题,还是比较简单的。
DF[DF$Date%in%DATES & as.difftime(DF$Time)>=as.difftime('00:04:00') & as.difftime(DF$Time)<=as.difftime('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性 5:润滑
lubridate 包被广泛认为是在 R 中处理 date/time 的最佳包。它提供了一个表示常规持续时间的持续时间类型,以及一个允许表示计数的周期类型各种不规则的时间单位。从历史上看,date/time 图书馆有时会失败,因为它们对不规则时间段和规则时间段之间的区别缺乏认识。
在下面的解决方案中,hms()
调用了周期类型的 return 个实例,因此我们实际上是在比较不同的时间单位。顺便说一句,关于单独时间单位的实际存储,lubridate 的设计是将秒值存储为双精度向量的实际负载,其余单位(分钟、小时、天、月和年)存储为属性对象。
library(lubridate);
DF[DF$Date%in%DATES & hms(DF$Time)>=hms('00:04:00') & hms(DF$Time)<=hms('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
我有一个名为 DF 的数据框,其中包含时间和日期列。我想根据这些列中的值对 DF 进行子集化。对于日期,我在 DATES 中有一个日期列表,并且正在对 DATES 中存在 DF$Date 的 DF 行进行子集化。目前,我想从 00:04:00 到 00:06:00 进行子集化。我不知道如何做后者。
理想情况下,我希望通过指定范围(如 00:04:00 至 00:06:00 以及指定起点和分钟数来进行子集化,如 [=18] =] 和 3 分钟(两种不同的方法)。我想这一切都归结为制作一个时间序列,并将这样的序列放在一个单独的向量中,然后我可以用它来进行匹配。
请注意,这只是一个可重现的简短示例。我正在寻找一种通用的方法来执行此操作,因为在实践中我想对大时间跨度进行子集化。另请注意,尽管在此示例中只有一个匹配日期,但实际上会有许多跨越多年的匹配日期。这就是为什么我认为不可能使用 POSIXlt
来制作时间序列。非常感谢。
#DF looks like this:
# DateTime XXX Time Date
#1371 2016-04-25 00:08:00 14 00:08:00 2016-04-25
#1372 2016-04-25 00:07:00 13 00:07:00 2016-04-25
#1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
#1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
#1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
#1376 2016-04-25 00:03:00 4 00:03:00 2016-04-25
#1377 2016-04-25 00:02:00 6 00:02:00 2016-04-25
#1387 2016-04-24 23:52:00 41 23:52:00 2016-04-24
#1388 2016-04-24 23:51:00 93 23:51:00 2016-04-24
#1389 2016-04-24 23:50:00 53 23:50:00 2016-04-24
#Code for DF, DATES, and to subset DF based on DATES
DF <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 0L, 59L, 58L, 57L, 56L, 55L, 54L, 53L, 52L, 51L, 50L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), mday = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L), mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), yday = c(115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Open = c(14, 13, 14, 3, 2, 4, 6, 4, 15, 15, 23, 24, 33, 14, 65, 54, 41, 93, 53), Time = c("00:08:00", "00:07:00", "00:06:00", "00:05:00", "00:04:00", "00:03:00", "00:02:00", "00:01:00", "00:00:00", "23:59:00", "23:58:00", "23:57:00", "23:56:00", "23:55:00", "23:54:00", "23:53:00", "23:52:00", "23:51:00", "23:50:00"), Date = structure(c(16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915), class = "Date")), .Names = c("DateTime", "XXX", "Time", "Date"), row.names = c("1371", "1372", "1373", "1374", "1375", "1376", "1377", "1378", "1379", "1380", "1381", "1382", "1383", "1384", "1385", "1386", "1387", "1388", "1389"), class = "data.frame")
DATES <- structure(c(12431, 12432, 10445, 10480, 11487, 12494, 12501, 12508, 13115, 13522, 14529, 15536, 16916, 16935), class = "Date")
SELEC <- DF[DF$Date %in% DATES,]
#Result of subsetting by Date:
# DateTime XXX Time Date
# 1371 2016-04-25 00:08:00 14 00:08:00 2016-04-25
# 1372 2016-04-25 00:07:00 13 00:07:00 2016-04-25
# 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
# 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
# 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
# 1376 2016-04-25 00:03:00 4 00:03:00 2016-04-25
# 1377 2016-04-25 00:02:00 6 00:02:00 2016-04-25
# 1378 2016-04-25 00:01:00 4 00:01:00 2016-04-25
# 1379 2016-04-25 00:00:00 15 00:00:00 2016-04-25
#How the final product would look like if using a larger data base spanning many years:
# DateTime XXX Time Date
#2016-04-25 00:06:00 13 00:06:00 2016-04-25
#2016-04-25 00:05:00 14 00:05:00 2016-04-25
#2016-04-25 00:04:00 3 00:04:00 2016-04-25
#2014-03-11 00:06:00 94 00:06:00 2014-03-11
#2014-03-11 00:05:00 6 00:05:00 2014-03-11
#2014-03-11 00:04:00 14 00:04:00 2014-03-11
#2011-08-06 00:06:00 13 00:06:00 2011-08-06
#2011-08-06 00:05:00 19 00:05:00 2011-08-06
#2011-08-06 00:04:00 41 00:04:00 2011-08-06
这个怎么样?
DF$Time <- strptime(DF$Time,format = '%H:%M:%S')
timeCondition <- (DF$Time >= strptime('00:04:00',format = '%H:%M:%S')) & (DF$Time <= strptime('00:06:00',format = '%H:%M:%S'))
SELEC <- DF[timeCondition & DF$Date %in% DATES,]
给出:
DateTime XXX Time Date
1373 2016-04-25 00:06:00 14 2016-05-14 00:06:00 2016-04-25
1374 2016-04-25 00:05:00 3 2016-05-14 00:05:00 2016-04-25
1375 2016-04-25 00:04:00 2 2016-05-14 00:04:00 2016-04-25
可能性一:字典序比较
如果所有时间值都存储为具有相同分隔符的零填充 24 小时字符串,例如 %H:%M:%S
,则可以使用词典顺序比较来应用过滤器。
DF[DF$Date%in%DATES & DF$Time>='00:04:00' & DF$Time<='00:06:00',];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
词典式解决方案当然不理想,因为它们不适合基于时间的数学运算,例如加、减、乘、除等。
更好的解决方案涉及将时间值转换为数字类型,该数字类型将持续时间编码为与显式或未指定的基准时间的偏移量。这就是流行的 date/time 库编码类型的方式,例如 Java 的 boost date_time for C++, Joda-Time 和 R 的 POSIXct、difftime 和 lubridate。
可能性 2:手动数字
可以自己解析字符串以构造表示持续时间的数字,并使用数字比较来应用过滤器。
hmsToDouble <- function(hms) as.double(substr(hms,1,2))*3600 + as.double(substr(hms,4,5))*60 + as.double(substr(hms,7,8));
DF[DF$Date%in%DATES & hmsToDouble(DF$Time)>=hmsToDouble('00:04:00') & hmsToDouble(DF$Time)<=hmsToDouble('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性三:POSIXt
我们可以生成 POSIXt(即 POSIXct 或 POSIXlt)值的向量,并对这些向量使用向量化比较。
DF[DF$Date%in%DATES & DF$DateTime>=as.POSIXct(paste0(DF$Date,' 00:04:00')) & DF$DateTime<=as.POSIXct(paste0(DF$Date,' 00:06:00')),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性4:difftime
R 中唯一内置的持续时间数据类型是 difftime 类型,使用起来可能有点挑剔。但是对于这个问题,还是比较简单的。
DF[DF$Date%in%DATES & as.difftime(DF$Time)>=as.difftime('00:04:00') & as.difftime(DF$Time)<=as.difftime('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
可能性 5:润滑
lubridate 包被广泛认为是在 R 中处理 date/time 的最佳包。它提供了一个表示常规持续时间的持续时间类型,以及一个允许表示计数的周期类型各种不规则的时间单位。从历史上看,date/time 图书馆有时会失败,因为它们对不规则时间段和规则时间段之间的区别缺乏认识。
在下面的解决方案中,hms()
调用了周期类型的 return 个实例,因此我们实际上是在比较不同的时间单位。顺便说一句,关于单独时间单位的实际存储,lubridate 的设计是将秒值存储为双精度向量的实际负载,其余单位(分钟、小时、天、月和年)存储为属性对象。
library(lubridate);
DF[DF$Date%in%DATES & hms(DF$Time)>=hms('00:04:00') & hms(DF$Time)<=hms('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25