没有 NA 的所有数据集之间的公共周期
Common periods between all data sets where there are no NA
对于 6 个月内每 10 秒测量一次的 3 个数据帧值,我想比较这些数据帧,但问题是它们在这 6 个月的不同时间包含许多缺失值间隙。
现在,我试图找到一种方法来比较这 3 个数据帧,以便**找到 3 个数据帧之间没有缺失值的公共周期。**所以我想知道确切的日期和时间所有数据框中的数据,以便提取这些数据并继续我的分析。
例如,这是一个输入数据
#df1
date V1
2010-02-01 00:00:00 15278
2010-02-01 00:00:10 15257
2010-02-01 00:00:20 15273
2010-02-01 00:00:30 15386
2010-02-01 00:00:40 15333
2010-02-01 00:00:50 15360
2010-02-01 00:01:00 17357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 14214
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14233
2010-02-01 00:02:20 14183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 14070
2010-02-01 00:02:50 na
...
和 df2
#df2
date V2
2010-02-01 00:00:00 15
2010-02-01 00:00:10 12
2010-02-01 00:00:20 13
2010-02-01 00:00:30 16
2010-02-01 00:00:40 13
2010-02-01 00:00:50 15
2010-02-01 00:01:00 17
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 16
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14
2010-02-01 00:02:20 11
2010-02-01 00:02:30 10
2010-02-01 00:02:40 13
2010-02-01 00:02:50 17
...
对于df3
#df3
date V3
2010-02-01 00:00:00 11278
2010-02-01 00:00:10 11257
2010-02-01 00:00:20 11273
2010-02-01 00:00:30 12386
2010-02-01 00:00:40 13333
2010-02-01 00:00:50 na
2010-02-01 00:01:00 11357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 12542
2010-02-01 00:02:00 na
2010-02-01 00:02:10 na
2010-02-01 00:02:20 13183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 18850
2010-02-01 00:02:50 14770
...
输出结果必须是
2010-02-01 00:00:00 to 2010-02-01 00:00:40
2010-02-01 00:01:00 to 2010-02-01 00:01:00 # as data available at this time in al data frames
2010-02-01 00:01:50 to 2010-02-01 00:01:50 # as data available at this time in al data frames
2010-02-01 00:02:20 to 2010-02-01 00:02:40
我觉得你可以用下面的操作。以下是可读格式的数据。
df1 <- tibble::tribble(
~date, ~V1,
"2010-02-01 00:00:00", 15278,
"2010-02-01 00:00:10", 15257,
"2010-02-01 00:00:20", 15273,
"2010-02-01 00:00:30", 15386,
"2010-02-01 00:00:40", 15333,
"2010-02-01 00:00:50", 15360,
"2010-02-01 00:01:00", 17357,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 14214,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", 14233,
"2010-02-01 00:02:20", 14183,
"2010-02-01 00:02:30", 14100,
"2010-02-01 00:02:40", 14070,
"2010-02-01 00:02:50", NA)
df2 <- tibble::tribble(
~date, ~V2,
"2010-02-01 00:00:00", 15,
"2010-02-01 00:00:10", 12,
"2010-02-01 00:00:20", 13,
"2010-02-01 00:00:30", 16,
"2010-02-01 00:00:40", 13,
"2010-02-01 00:00:50", 15,
"2010-02-01 00:01:00", 17,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 16,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", 14,
"2010-02-01 00:02:20", 11,
"2010-02-01 00:02:30", 10,
"2010-02-01 00:02:40", 13,
"2010-02-01 00:02:50", 17)
df3 <- tibble::tribble(
~date, ~ V3,
"2010-02-01 00:00:00", 11278,
"2010-02-01 00:00:10", 11257,
"2010-02-01 00:00:20", 11273,
"2010-02-01 00:00:30", 12386,
"2010-02-01 00:00:40", 13333,
"2010-02-01 00:00:50", NA,
"2010-02-01 00:01:00", 11357,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 12542,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", NA,
"2010-02-01 00:02:20", 13183,
"2010-02-01 00:02:30", 14100,
"2010-02-01 00:02:40", 18850,
"2010-02-01 00:02:50", 14770)
首先,您可以确保日期采用适当的日期格式。
df1 <- df1 %>% mutate(date = lubridate::ymd_hms(date))
df2 <- df2 %>% mutate(date = lubridate::ymd_hms(date))
df3 <- df3 %>% mutate(date = lubridate::ymd_hms(date))
保存原始数据帧以备后用:
df1_orig <- df1
df2_orig <- df2
df3_orig <- df3
然后,listwise删除所有数据
df1 <- na.omit(df1)
df2 <- na.omit(df2)
df3 <- na.omit(df3)
接下来,您需要 inner_join()
,因为它只保留两个数据集共有的观察结果。
df_all <- inner_join(df1, df2)
df_all <- inner_join(df_all, df3)
现在,df_all
只有三个数据集共有的完整数据。然后你可以获取日期的滞后(之前的观察)并评估它是否比当前观察早 10 秒,在这种情况下 cont
值将为 0 或者它是否超过 10 秒,其中cont
变量将为 1。通过对 cont
变量求和,它将识别数据中不同的连续观察组。
df_all <- df_all %>%
mutate(lag_date = lag(date),
cont = as.numeric(lag_date != (date - lubridate::hms("00:00:10"))),
cont = ifelse(is.na(cont), 1, cont),
group = cumsum(cont))
最后,您可以按 group
变量分组,然后在每个组中找到 date
的最小值和最大值。
res <- df_all %>% group_by(group) %>%
summarise(start = min(date), end = max(date))
res
#
# # A tibble: 4 x 3
# group start end
# * <dbl> <dttm> <dttm>
# 1 1 2010-02-01 00:00:00 2010-02-01 00:00:40
# 2 2 2010-02-01 00:01:00 2010-02-01 00:01:00
# 3 3 2010-02-01 00:01:50 2010-02-01 00:01:50
# 4 4 2010-02-01 00:02:20 2010-02-01 00:02:40
我知道您有很多数据,所以希望这会足够快。我的经验是,dplyr
函数似乎比它们的基本 R 对应函数具有更好的扩展性,所以希望这里也是如此。
编辑:过滤原始数据
要过滤原始数据以仅包含这些时间,您可以执行以下操作:
keep_times <- res %>%
rowwise %>%
mutate(date = list(seq(from=start, to=end, by=lubridate:::hms("00:00:10")))) %>%
unnest(date) %>%
ungroup %>%
select(date)
上面的代码在每行中从开始时间到结束时间生成一个 10 秒间隔的序列。然后它取消嵌套列表,然后它只保留序列。然后你可以 left_join 这个到原始数据:
d1 <- left_join(keep_times, df1_orig)
d2 <- left_join(keep_times, df2_orig)
d3 <- left_join(keep_times, df3_orig)
结果如下:
d1
# # A tibble: 10 x 2
# date V1
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15278
# 2 2010-02-01 00:00:10 15257
# 3 2010-02-01 00:00:20 15273
# 4 2010-02-01 00:00:30 15386
# 5 2010-02-01 00:00:40 15333
# 6 2010-02-01 00:01:00 17357
# 7 2010-02-01 00:01:50 14214
# 8 2010-02-01 00:02:20 14183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 14070
d2
# # A tibble: 10 x 2
# date V2
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15
# 2 2010-02-01 00:00:10 12
# 3 2010-02-01 00:00:20 13
# 4 2010-02-01 00:00:30 16
# 5 2010-02-01 00:00:40 13
# 6 2010-02-01 00:01:00 17
# 7 2010-02-01 00:01:50 16
# 8 2010-02-01 00:02:20 11
# 9 2010-02-01 00:02:30 10
# 10 2010-02-01 00:02:40 13
d3
# # A tibble: 10 x 2
# date V3
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 11278
# 2 2010-02-01 00:00:10 11257
# 3 2010-02-01 00:00:20 11273
# 4 2010-02-01 00:00:30 12386
# 5 2010-02-01 00:00:40 13333
# 6 2010-02-01 00:01:00 11357
# 7 2010-02-01 00:01:50 12542
# 8 2010-02-01 00:02:20 13183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 18850
对于 6 个月内每 10 秒测量一次的 3 个数据帧值,我想比较这些数据帧,但问题是它们在这 6 个月的不同时间包含许多缺失值间隙。 现在,我试图找到一种方法来比较这 3 个数据帧,以便**找到 3 个数据帧之间没有缺失值的公共周期。**所以我想知道确切的日期和时间所有数据框中的数据,以便提取这些数据并继续我的分析。
例如,这是一个输入数据
#df1
date V1
2010-02-01 00:00:00 15278
2010-02-01 00:00:10 15257
2010-02-01 00:00:20 15273
2010-02-01 00:00:30 15386
2010-02-01 00:00:40 15333
2010-02-01 00:00:50 15360
2010-02-01 00:01:00 17357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 14214
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14233
2010-02-01 00:02:20 14183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 14070
2010-02-01 00:02:50 na
...
和 df2
#df2
date V2
2010-02-01 00:00:00 15
2010-02-01 00:00:10 12
2010-02-01 00:00:20 13
2010-02-01 00:00:30 16
2010-02-01 00:00:40 13
2010-02-01 00:00:50 15
2010-02-01 00:01:00 17
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 16
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14
2010-02-01 00:02:20 11
2010-02-01 00:02:30 10
2010-02-01 00:02:40 13
2010-02-01 00:02:50 17
...
对于df3
#df3
date V3
2010-02-01 00:00:00 11278
2010-02-01 00:00:10 11257
2010-02-01 00:00:20 11273
2010-02-01 00:00:30 12386
2010-02-01 00:00:40 13333
2010-02-01 00:00:50 na
2010-02-01 00:01:00 11357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 12542
2010-02-01 00:02:00 na
2010-02-01 00:02:10 na
2010-02-01 00:02:20 13183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 18850
2010-02-01 00:02:50 14770
...
输出结果必须是
2010-02-01 00:00:00 to 2010-02-01 00:00:40
2010-02-01 00:01:00 to 2010-02-01 00:01:00 # as data available at this time in al data frames
2010-02-01 00:01:50 to 2010-02-01 00:01:50 # as data available at this time in al data frames
2010-02-01 00:02:20 to 2010-02-01 00:02:40
我觉得你可以用下面的操作。以下是可读格式的数据。
df1 <- tibble::tribble(
~date, ~V1,
"2010-02-01 00:00:00", 15278,
"2010-02-01 00:00:10", 15257,
"2010-02-01 00:00:20", 15273,
"2010-02-01 00:00:30", 15386,
"2010-02-01 00:00:40", 15333,
"2010-02-01 00:00:50", 15360,
"2010-02-01 00:01:00", 17357,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 14214,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", 14233,
"2010-02-01 00:02:20", 14183,
"2010-02-01 00:02:30", 14100,
"2010-02-01 00:02:40", 14070,
"2010-02-01 00:02:50", NA)
df2 <- tibble::tribble(
~date, ~V2,
"2010-02-01 00:00:00", 15,
"2010-02-01 00:00:10", 12,
"2010-02-01 00:00:20", 13,
"2010-02-01 00:00:30", 16,
"2010-02-01 00:00:40", 13,
"2010-02-01 00:00:50", 15,
"2010-02-01 00:01:00", 17,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 16,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", 14,
"2010-02-01 00:02:20", 11,
"2010-02-01 00:02:30", 10,
"2010-02-01 00:02:40", 13,
"2010-02-01 00:02:50", 17)
df3 <- tibble::tribble(
~date, ~ V3,
"2010-02-01 00:00:00", 11278,
"2010-02-01 00:00:10", 11257,
"2010-02-01 00:00:20", 11273,
"2010-02-01 00:00:30", 12386,
"2010-02-01 00:00:40", 13333,
"2010-02-01 00:00:50", NA,
"2010-02-01 00:01:00", 11357,
"2010-02-01 00:01:10", NA,
"2010-02-01 00:01:20", NA,
"2010-02-01 00:01:30", NA,
"2010-02-01 00:01:40", NA,
"2010-02-01 00:01:50", 12542,
"2010-02-01 00:02:00", NA,
"2010-02-01 00:02:10", NA,
"2010-02-01 00:02:20", 13183,
"2010-02-01 00:02:30", 14100,
"2010-02-01 00:02:40", 18850,
"2010-02-01 00:02:50", 14770)
首先,您可以确保日期采用适当的日期格式。
df1 <- df1 %>% mutate(date = lubridate::ymd_hms(date))
df2 <- df2 %>% mutate(date = lubridate::ymd_hms(date))
df3 <- df3 %>% mutate(date = lubridate::ymd_hms(date))
保存原始数据帧以备后用:
df1_orig <- df1
df2_orig <- df2
df3_orig <- df3
然后,listwise删除所有数据
df1 <- na.omit(df1)
df2 <- na.omit(df2)
df3 <- na.omit(df3)
接下来,您需要 inner_join()
,因为它只保留两个数据集共有的观察结果。
df_all <- inner_join(df1, df2)
df_all <- inner_join(df_all, df3)
现在,df_all
只有三个数据集共有的完整数据。然后你可以获取日期的滞后(之前的观察)并评估它是否比当前观察早 10 秒,在这种情况下 cont
值将为 0 或者它是否超过 10 秒,其中cont
变量将为 1。通过对 cont
变量求和,它将识别数据中不同的连续观察组。
df_all <- df_all %>%
mutate(lag_date = lag(date),
cont = as.numeric(lag_date != (date - lubridate::hms("00:00:10"))),
cont = ifelse(is.na(cont), 1, cont),
group = cumsum(cont))
最后,您可以按 group
变量分组,然后在每个组中找到 date
的最小值和最大值。
res <- df_all %>% group_by(group) %>%
summarise(start = min(date), end = max(date))
res
#
# # A tibble: 4 x 3
# group start end
# * <dbl> <dttm> <dttm>
# 1 1 2010-02-01 00:00:00 2010-02-01 00:00:40
# 2 2 2010-02-01 00:01:00 2010-02-01 00:01:00
# 3 3 2010-02-01 00:01:50 2010-02-01 00:01:50
# 4 4 2010-02-01 00:02:20 2010-02-01 00:02:40
我知道您有很多数据,所以希望这会足够快。我的经验是,dplyr
函数似乎比它们的基本 R 对应函数具有更好的扩展性,所以希望这里也是如此。
编辑:过滤原始数据
要过滤原始数据以仅包含这些时间,您可以执行以下操作:
keep_times <- res %>%
rowwise %>%
mutate(date = list(seq(from=start, to=end, by=lubridate:::hms("00:00:10")))) %>%
unnest(date) %>%
ungroup %>%
select(date)
上面的代码在每行中从开始时间到结束时间生成一个 10 秒间隔的序列。然后它取消嵌套列表,然后它只保留序列。然后你可以 left_join 这个到原始数据:
d1 <- left_join(keep_times, df1_orig)
d2 <- left_join(keep_times, df2_orig)
d3 <- left_join(keep_times, df3_orig)
结果如下:
d1
# # A tibble: 10 x 2
# date V1
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15278
# 2 2010-02-01 00:00:10 15257
# 3 2010-02-01 00:00:20 15273
# 4 2010-02-01 00:00:30 15386
# 5 2010-02-01 00:00:40 15333
# 6 2010-02-01 00:01:00 17357
# 7 2010-02-01 00:01:50 14214
# 8 2010-02-01 00:02:20 14183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 14070
d2
# # A tibble: 10 x 2
# date V2
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15
# 2 2010-02-01 00:00:10 12
# 3 2010-02-01 00:00:20 13
# 4 2010-02-01 00:00:30 16
# 5 2010-02-01 00:00:40 13
# 6 2010-02-01 00:01:00 17
# 7 2010-02-01 00:01:50 16
# 8 2010-02-01 00:02:20 11
# 9 2010-02-01 00:02:30 10
# 10 2010-02-01 00:02:40 13
d3
# # A tibble: 10 x 2
# date V3
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 11278
# 2 2010-02-01 00:00:10 11257
# 3 2010-02-01 00:00:20 11273
# 4 2010-02-01 00:00:30 12386
# 5 2010-02-01 00:00:40 13333
# 6 2010-02-01 00:01:00 11357
# 7 2010-02-01 00:01:50 12542
# 8 2010-02-01 00:02:20 13183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 18850