如何基于时间块创建唯一索引
How to create a unique index based on chunks of time
我希望可以在 R 中为时间段创建唯一索引或 ID。
我有一个秒级时间数据的大数据集。时间上有中断,理论上可以让我“分组”时间块并为它们分配一个唯一的索引或编号。
我将尝试创建一个可重现的示例,但请记住,块中的持续时间会发生变化,时间间隔不会均匀分布,并且日期可能会从一天变为另一天。
#this is what the dataframe would look like
DateTime
2021-07-12 20:28:26 CDT
2021-07-12 20:28:27 CDT
2021-07-12 20:28:28 CDT
2021-07-12 20:28:29 CDT
2021-07-12 20:28:30 CDT
2021-07-12 23:14:28 CDT
2021-07-12 23:14:29 CDT
2021-07-12 23:14:30 CDT
2021-07-12 23:14:31 CDT
2021-07-12 23:14:32 CDT
2021-07-12 23:14:33 CDT
2021-07-12 23:14:34 CDT
2021-07-12 23:14:35 CDT
2021-07-12 23:14:36 CDT
2021-07-27 17:16:05 CDT
2021-07-27 17:16:06 CDT
2021-07-27 17:16:07 CDT
2021-07-27 17:16:08 CDT
2021-07-27 17:16:09 CDT
2021-07-27 17:16:10 CDT
2021-07-27 17:16:11 CDT
2021-07-27 17:16:12 CDT
2021-07-27 17:16:13 CDT
2021-07-27 17:16:14 CDT
2021-07-27 17:16:15 CDT
#this is for reproducing time times
structure(c(1626139706, 1626139707, 1626139708, 1626139709, 1626139710, 1626149668, 1626149669, 1626149670, 1626149671, 1626149672, 1626149673, 1626149674, 1626149675, 1626149676, 1627424165, 1627424166, 1627424167, 1627424168, 1627424169, 1627424170, 1627424171, 1627424172, 1627424173, 1627424174, 1627424175),
class = c("POSIXct", "POSIXt"), tzone = "")
再次,我希望给sections/chunks的时间分配一个唯一的数字。它看起来像下面这样:
DateTime Index
2021-07-12 20:28:26 CDT 1
2021-07-12 20:28:27 CDT 1
2021-07-12 20:28:28 CDT 1
2021-07-12 20:28:29 CDT 1
2021-07-12 20:28:30 CDT 1
2021-07-12 23:14:28 CDT 2
2021-07-12 23:14:29 CDT 2
2021-07-12 23:14:30 CDT 2
2021-07-12 23:14:31 CDT 2
2021-07-12 23:14:32 CDT 2
2021-07-12 23:14:33 CDT 2
2021-07-12 23:14:34 CDT 2
2021-07-12 23:14:35 CDT 2
2021-07-12 23:14:36 CDT 2
2021-07-27 17:16:05 CDT 3
2021-07-27 17:16:06 CDT 3
2021-07-27 17:16:07 CDT 3
2021-07-27 17:16:08 CDT 3
2021-07-27 17:16:09 CDT 3
2021-07-27 17:16:10 CDT 3
2021-07-27 17:16:11 CDT 3
2021-07-27 17:16:12 CDT 3
2021-07-27 17:16:13 CDT 3
2021-07-27 17:16:14 CDT 3
2021-07-27 17:16:15 CDT 3
#edit: something like this is possibility but isn't included in the reproducible example.
DateTime Index
2021-07-15 23:59:59 CDT 4
2021-07-16 00:00:00 CDT 4
这是我找到的最接近我正在寻找的东西:
但我不确定如何进行。任何帮助将不胜感激谢谢。
library(dplyr)
data.frame(DateTime) %>%
mutate(Index = 1 + cumsum(DateTime - lag(DateTime,1,min(DateTime)) > 60))
每当有 1 分钟或更长时间的休息时,这将创建一个新组。日期时间“在引擎盖下”存储为秒,因此与先前 ('lag') 值相差 60 是一分钟。 cumsum
正在捕获发生如此大的中断的累计次数。
DateTime Index
1 2021-07-12 18:28:26 1
2 2021-07-12 18:28:27 1
3 2021-07-12 18:28:28 1
4 2021-07-12 18:28:29 1
5 2021-07-12 18:28:30 1
6 2021-07-12 21:14:28 2
7 2021-07-12 21:14:29 2
8 2021-07-12 21:14:30 2
9 2021-07-12 21:14:31 2
10 2021-07-12 21:14:32 2
11 2021-07-12 21:14:33 2
12 2021-07-12 21:14:34 2
13 2021-07-12 21:14:35 2
14 2021-07-12 21:14:36 2
15 2021-07-27 15:16:05 3
16 2021-07-27 15:16:06 3
17 2021-07-27 15:16:07 3
18 2021-07-27 15:16:08 3
19 2021-07-27 15:16:09 3
20 2021-07-27 15:16:10 3
21 2021-07-27 15:16:11 3
22 2021-07-27 15:16:12 3
23 2021-07-27 15:16:13 3
24 2021-07-27 15:16:14 3
25 2021-07-27 15:16:15 3
如果我们正在寻找每分钟变化增加1的索引,那么可以使用floor_date
library(lubridate)
library(tibble)
library(dplyr)
tibble(DateTime) %>%
mutate(Index =floor_date(DateTime, unit = 'minute'),
Index = match(Index, unique(Index)))
-输出
# A tibble: 25 × 2
DateTime Index
<dttm> <int>
1 2021-07-12 21:28:26 1
2 2021-07-12 21:28:27 1
3 2021-07-12 21:28:28 1
4 2021-07-12 21:28:29 1
5 2021-07-12 21:28:30 1
6 2021-07-13 00:14:28 2
7 2021-07-13 00:14:29 2
8 2021-07-13 00:14:30 2
9 2021-07-13 00:14:31 2
10 2021-07-13 00:14:32 2
# … with 15 more rows
这是另一种方法:
library(dplyr)
tibble(DateTime) %>%
mutate(DateTime1 = lag(DateTime, default = DateTime[1])) %>%
mutate(helper = DateTime - DateTime1) %>%
group_by(Index = cumsum(helper!=1)) %>%
select(-DateTime1, -helper)
数据:
DateTime <- structure(c(1626139706, 1626139707, 1626139708, 1626139709, 1626139710, 1626149668, 1626149669, 1626149670, 1626149671, 1626149672, 1626149673, 1626149674, 1626149675, 1626149676, 1627424165, 1627424166, 1627424167, 1627424168, 1627424169, 1627424170, 1627424171, 1627424172, 1627424173, 1627424174, 1627424175),
class = c("POSIXct", "POSIXt"), tzone = "")
输出:
DateTime Index
<dttm> <int>
1 2021-07-13 03:28:26 1
2 2021-07-13 03:28:27 1
3 2021-07-13 03:28:28 1
4 2021-07-13 03:28:29 1
5 2021-07-13 03:28:30 1
6 2021-07-13 06:14:28 2
7 2021-07-13 06:14:29 2
8 2021-07-13 06:14:30 2
9 2021-07-13 06:14:31 2
10 2021-07-13 06:14:32 2
11 2021-07-13 06:14:33 2
12 2021-07-13 06:14:34 2
13 2021-07-13 06:14:35 2
14 2021-07-13 06:14:36 2
15 2021-07-28 00:16:05 3
16 2021-07-28 00:16:06 3
17 2021-07-28 00:16:07 3
18 2021-07-28 00:16:08 3
19 2021-07-28 00:16:09 3
20 2021-07-28 00:16:10 3
21 2021-07-28 00:16:11 3
22 2021-07-28 00:16:12 3
23 2021-07-28 00:16:13 3
24 2021-07-28 00:16:14 3
25 2021-07-28 00:16:15 3
我希望可以在 R 中为时间段创建唯一索引或 ID。
我有一个秒级时间数据的大数据集。时间上有中断,理论上可以让我“分组”时间块并为它们分配一个唯一的索引或编号。
我将尝试创建一个可重现的示例,但请记住,块中的持续时间会发生变化,时间间隔不会均匀分布,并且日期可能会从一天变为另一天。
#this is what the dataframe would look like
DateTime
2021-07-12 20:28:26 CDT
2021-07-12 20:28:27 CDT
2021-07-12 20:28:28 CDT
2021-07-12 20:28:29 CDT
2021-07-12 20:28:30 CDT
2021-07-12 23:14:28 CDT
2021-07-12 23:14:29 CDT
2021-07-12 23:14:30 CDT
2021-07-12 23:14:31 CDT
2021-07-12 23:14:32 CDT
2021-07-12 23:14:33 CDT
2021-07-12 23:14:34 CDT
2021-07-12 23:14:35 CDT
2021-07-12 23:14:36 CDT
2021-07-27 17:16:05 CDT
2021-07-27 17:16:06 CDT
2021-07-27 17:16:07 CDT
2021-07-27 17:16:08 CDT
2021-07-27 17:16:09 CDT
2021-07-27 17:16:10 CDT
2021-07-27 17:16:11 CDT
2021-07-27 17:16:12 CDT
2021-07-27 17:16:13 CDT
2021-07-27 17:16:14 CDT
2021-07-27 17:16:15 CDT
#this is for reproducing time times
structure(c(1626139706, 1626139707, 1626139708, 1626139709, 1626139710, 1626149668, 1626149669, 1626149670, 1626149671, 1626149672, 1626149673, 1626149674, 1626149675, 1626149676, 1627424165, 1627424166, 1627424167, 1627424168, 1627424169, 1627424170, 1627424171, 1627424172, 1627424173, 1627424174, 1627424175),
class = c("POSIXct", "POSIXt"), tzone = "")
再次,我希望给sections/chunks的时间分配一个唯一的数字。它看起来像下面这样:
DateTime Index
2021-07-12 20:28:26 CDT 1
2021-07-12 20:28:27 CDT 1
2021-07-12 20:28:28 CDT 1
2021-07-12 20:28:29 CDT 1
2021-07-12 20:28:30 CDT 1
2021-07-12 23:14:28 CDT 2
2021-07-12 23:14:29 CDT 2
2021-07-12 23:14:30 CDT 2
2021-07-12 23:14:31 CDT 2
2021-07-12 23:14:32 CDT 2
2021-07-12 23:14:33 CDT 2
2021-07-12 23:14:34 CDT 2
2021-07-12 23:14:35 CDT 2
2021-07-12 23:14:36 CDT 2
2021-07-27 17:16:05 CDT 3
2021-07-27 17:16:06 CDT 3
2021-07-27 17:16:07 CDT 3
2021-07-27 17:16:08 CDT 3
2021-07-27 17:16:09 CDT 3
2021-07-27 17:16:10 CDT 3
2021-07-27 17:16:11 CDT 3
2021-07-27 17:16:12 CDT 3
2021-07-27 17:16:13 CDT 3
2021-07-27 17:16:14 CDT 3
2021-07-27 17:16:15 CDT 3
#edit: something like this is possibility but isn't included in the reproducible example.
DateTime Index
2021-07-15 23:59:59 CDT 4
2021-07-16 00:00:00 CDT 4
这是我找到的最接近我正在寻找的东西:
但我不确定如何进行。任何帮助将不胜感激谢谢。
library(dplyr)
data.frame(DateTime) %>%
mutate(Index = 1 + cumsum(DateTime - lag(DateTime,1,min(DateTime)) > 60))
每当有 1 分钟或更长时间的休息时,这将创建一个新组。日期时间“在引擎盖下”存储为秒,因此与先前 ('lag') 值相差 60 是一分钟。 cumsum
正在捕获发生如此大的中断的累计次数。
DateTime Index
1 2021-07-12 18:28:26 1
2 2021-07-12 18:28:27 1
3 2021-07-12 18:28:28 1
4 2021-07-12 18:28:29 1
5 2021-07-12 18:28:30 1
6 2021-07-12 21:14:28 2
7 2021-07-12 21:14:29 2
8 2021-07-12 21:14:30 2
9 2021-07-12 21:14:31 2
10 2021-07-12 21:14:32 2
11 2021-07-12 21:14:33 2
12 2021-07-12 21:14:34 2
13 2021-07-12 21:14:35 2
14 2021-07-12 21:14:36 2
15 2021-07-27 15:16:05 3
16 2021-07-27 15:16:06 3
17 2021-07-27 15:16:07 3
18 2021-07-27 15:16:08 3
19 2021-07-27 15:16:09 3
20 2021-07-27 15:16:10 3
21 2021-07-27 15:16:11 3
22 2021-07-27 15:16:12 3
23 2021-07-27 15:16:13 3
24 2021-07-27 15:16:14 3
25 2021-07-27 15:16:15 3
如果我们正在寻找每分钟变化增加1的索引,那么可以使用floor_date
library(lubridate)
library(tibble)
library(dplyr)
tibble(DateTime) %>%
mutate(Index =floor_date(DateTime, unit = 'minute'),
Index = match(Index, unique(Index)))
-输出
# A tibble: 25 × 2
DateTime Index
<dttm> <int>
1 2021-07-12 21:28:26 1
2 2021-07-12 21:28:27 1
3 2021-07-12 21:28:28 1
4 2021-07-12 21:28:29 1
5 2021-07-12 21:28:30 1
6 2021-07-13 00:14:28 2
7 2021-07-13 00:14:29 2
8 2021-07-13 00:14:30 2
9 2021-07-13 00:14:31 2
10 2021-07-13 00:14:32 2
# … with 15 more rows
这是另一种方法:
library(dplyr)
tibble(DateTime) %>%
mutate(DateTime1 = lag(DateTime, default = DateTime[1])) %>%
mutate(helper = DateTime - DateTime1) %>%
group_by(Index = cumsum(helper!=1)) %>%
select(-DateTime1, -helper)
数据:
DateTime <- structure(c(1626139706, 1626139707, 1626139708, 1626139709, 1626139710, 1626149668, 1626149669, 1626149670, 1626149671, 1626149672, 1626149673, 1626149674, 1626149675, 1626149676, 1627424165, 1627424166, 1627424167, 1627424168, 1627424169, 1627424170, 1627424171, 1627424172, 1627424173, 1627424174, 1627424175),
class = c("POSIXct", "POSIXt"), tzone = "")
输出:
DateTime Index
<dttm> <int>
1 2021-07-13 03:28:26 1
2 2021-07-13 03:28:27 1
3 2021-07-13 03:28:28 1
4 2021-07-13 03:28:29 1
5 2021-07-13 03:28:30 1
6 2021-07-13 06:14:28 2
7 2021-07-13 06:14:29 2
8 2021-07-13 06:14:30 2
9 2021-07-13 06:14:31 2
10 2021-07-13 06:14:32 2
11 2021-07-13 06:14:33 2
12 2021-07-13 06:14:34 2
13 2021-07-13 06:14:35 2
14 2021-07-13 06:14:36 2
15 2021-07-28 00:16:05 3
16 2021-07-28 00:16:06 3
17 2021-07-28 00:16:07 3
18 2021-07-28 00:16:08 3
19 2021-07-28 00:16:09 3
20 2021-07-28 00:16:10 3
21 2021-07-28 00:16:11 3
22 2021-07-28 00:16:12 3
23 2021-07-28 00:16:13 3
24 2021-07-28 00:16:14 3
25 2021-07-28 00:16:15 3