根据 dbplyr 中的时间戳序列改变新列

mutate new column based on sequence of time stamp in dbplyr

我有一个来自 PostgreSQL 数据库的数据集,我有两列:


id = 参与者ID

time_stamp = 记录测量的时间戳


我需要使用 dbplyr 才能根据 time_stamp 的序列改变新列。换句话说,如果time_stamp是顺序的(意思是它在一分钟间隔),这被识别为一个事件.

例如,这是我的数据集:

library(dplyr)
library(dbplyr)
library(lubridate)

mf <- memdb_frame(
  id = "id001", 
  time_stamp = c(
    seq(from = as_datetime("2021-01-01 08:00:00"), to = as_datetime("2021-01-01 08:03:00"), by = "1 min"),
    seq(from = as_datetime("2021-01-01 08:05:00"), to = as_datetime("2021-01-01 08:08:00"), by = "1 min"),
    seq(from = as_datetime("2021-01-01 08:12:00"), to = as_datetime("2021-01-01 08:18:00"), by = "1 min")
  )
)

mf %>% 
  collect() %>% 
  mutate(time_stamp = as_datetime(time_stamp))

#> # A tibble: 15 x 2
#>    id    time_stamp         
#>    <chr> <dttm>             
#>  1 id001 2021-01-01 08:00:00
#>  2 id001 2021-01-01 08:01:00
#>  3 id001 2021-01-01 08:02:00
#>  4 id001 2021-01-01 08:03:00
#>  5 id001 2021-01-01 08:05:00
#>  6 id001 2021-01-01 08:06:00
#>  7 id001 2021-01-01 08:07:00
#>  8 id001 2021-01-01 08:08:00
#>  9 id001 2021-01-01 08:12:00
#> 10 id001 2021-01-01 08:13:00
#> 11 id001 2021-01-01 08:14:00
#> 12 id001 2021-01-01 08:15:00
#> 13 id001 2021-01-01 08:16:00
#> 14 id001 2021-01-01 08:17:00
#> 15 id001 2021-01-01 08:18:00

现在,我需要识别 事件。这意味着,找到发生在 sequence 中的 time_stamps (sequence = 1 分钟间隔)。例如,这将是我的 预期输出 :

#> # A tibble: 15 x 3
#>    id    time_stamp          events 
#>    <chr> <dttm>              <chr>  
#>  1 id001 2021-01-01 08:00:00 event_1
#>  2 id001 2021-01-01 08:01:00 event_1
#>  3 id001 2021-01-01 08:02:00 event_1
#>  4 id001 2021-01-01 08:03:00 event_1
#>  5 id001 2021-01-01 08:05:00 event_2
#>  6 id001 2021-01-01 08:06:00 event_2
#>  7 id001 2021-01-01 08:07:00 event_2
#>  8 id001 2021-01-01 08:08:00 event_2
#>  9 id001 2021-01-01 08:12:00 event_3
#> 10 id001 2021-01-01 08:13:00 event_3
#> 11 id001 2021-01-01 08:14:00 event_3
#> 12 id001 2021-01-01 08:15:00 event_3
#> 13 id001 2021-01-01 08:16:00 event_3
#> 14 id001 2021-01-01 08:17:00 event_3
#> 15 id001 2021-01-01 08:18:00 event_3

Note that from row 4 to 5 there was an interval of 2 minutes – causing the next event to start. Same thing from row 8 to 9: there was an interval of 4 minutes and then the next event started.

PS: I need it to work fully in dbplyr, that is: without using collect()

如有任何想法,我们将不胜感激!

谢谢!

这是使用逻辑 1 mins 差异均值 1 事件的代码。

mf %>% 
  collect() %>% 
  mutate(time_stamp = as_datetime(time_stamp)) %>%
  mutate(diff_mins = difftime(time_stamp, lag(time_stamp, 1), units = "mins")) %>%
  mutate(event_count = if_else(diff_mins == 1 | is.na(diff_mins), 0, 1)) %>%
  mutate(event_index = paste0("event_", cumsum(event_count) + 1))

输出:

# A tibble: 15 x 5
   id    time_stamp          diff_mins event_count event_index
   <chr> <dttm>              <drtn>          <dbl> <chr>      
 1 id001 2021-01-01 08:00:00 NA mins             0 event_1    
 2 id001 2021-01-01 08:01:00  1 mins             0 event_1    
 3 id001 2021-01-01 08:02:00  1 mins             0 event_1    
 4 id001 2021-01-01 08:03:00  1 mins             0 event_1    
 5 id001 2021-01-01 08:05:00  2 mins             1 event_2    
 6 id001 2021-01-01 08:06:00  1 mins             0 event_2    
 7 id001 2021-01-01 08:07:00  1 mins             0 event_2    
 8 id001 2021-01-01 08:08:00  1 mins             0 event_2    
 9 id001 2021-01-01 08:12:00  4 mins             1 event_3    
10 id001 2021-01-01 08:13:00  1 mins             0 event_3    
11 id001 2021-01-01 08:14:00  1 mins             0 event_3    
12 id001 2021-01-01 08:15:00  1 mins             0 event_3    
13 id001 2021-01-01 08:16:00  1 mins             0 event_3    
14 id001 2021-01-01 08:17:00  1 mins             0 event_3    
15 id001 2021-01-01 08:18:00  1 mins             0 event_3    

在没有 collect 的情况下工作意味着我们主要限于 dplyr 函数,因为这些是定义了 SQL 翻译的函数。

与@Sinh_Nguyen非常相似,我建议如下:

output = mf %>%
  group_by(id) %>%
  arrange(time_stamp) %>%
  mutate(prev_time_stamp = lag(time_stamp, 1)) %>%
  mutate(hours_diff = DATEPART('hour', time_stamp - prev_time_stamp),
         min_part_diff = DATEPART('minute', time_stamp - prev_time_stamp)) %>%
  mutate(gap = hours_diff * 60 + min_part_diff) %>%
  mutate(is_gap = ifelse(is.na(prev_time_stamp) | gap == 1, 0, 1)) %>%
  mutate(event_index = cumsum(is_gap))

备注:

  • group_byarrange 在开始时发生一次,但被 lagcumsum 函数隐式使用。
  • 如果 dbplyr 没有定义翻译,那么它将按原样传递命令。大写 DATEPART 确保它不会被翻译,所以我们得到 PostgreSQL DATEPART 函数。我通常使用 SQL 服务器,所以我关注 these examples 如何计算 PostgreSQL 中的差异。
  • 最后两行的想法是在记录不是前一个事件的延续时创建二进制指示器。将这些指标相加会增加每个新事件的事件计数器。
  • 如有错误,可使用show_query(output)查看SQL翻译。查看/分享 SQL 翻译通常有助于解决问题。