删除彼此靠近超过特定时间的行,并在两个新列中添加有关已删除行的信息

Delete rows that are close to each other more than a specific time and add info about the deleted rows in two new columns

我有一个数据框 df1,它总结了随着时间的推移对不同动物的检测。 Rec 列指定哪个设备检测到它(V4V6 等),Ind 列指定个人。

我想删除满足以下条件的行:"there is a detection for the same animal within the previous 55 seconds"(如果检测来自不同的接收器则无关紧要)。

此外,我想创建这些列:

1) Num_Rec:它总结了在提到的 55 秒间隔内有多少其他 Rec 检测到动物。

2) Which_Rec:它总结了在上述 55 秒间隔内检测到动物的其他 Rec 的名称。

如果在 55 秒的间隔内同一只动物被同一只 Rec 捕获两次(即 df1 中的第 12 行和第 13 行),我认为第 2 行(= 检测)是一个错误(同一个接收者不可能在 55 秒内捕获同一个动物两次)并且我没有考虑 Num_RecWhich_Rec 列中的这一行(即在 Result 中我不在 Result$Num_Rec[11]Result$Which_Rec[11] 中都不算 df1$Datetime[13])。

举个例子:

df1<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))
df1$DateTime<- as.POSIXct(df1$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")

df1
              DateTime      Rec         Ind
1  2016-08-01 12:04:07       V6          16
2  2016-08-01 12:06:07       V7          17
3  2016-08-01 12:06:58       V6          16
4  2016-08-01 13:12:12       V6          16
5  2016-08-01 14:04:07       V7          17
6  2016-08-01 13:12:45       V7          16
7  2016-08-01 15:04:07       V6          17
8  2016-08-01 17:13:16       V7          16
9  2016-08-01 17:21:16       V7          17
10 2016-08-01 17:21:34       V7          16
11 2016-08-01 17:23:42       V6          16
12 2016-08-01 17:27:16       V6          17 
13 2016-08-01 17:27:22       V6          17
14 2016-08-01 17:28:01       V9          17 
15 2016-08-01 17:29:28       V7          16
16 2016-08-01 17:28:08       V4          17 

我想得到的是:

Result
              DateTime      Rec         Ind Num_Rec Which_Rec
1  2016-08-01 12:04:07       V6          16       0        NA
2  2016-08-01 12:06:07       V7          17       0        NA
3  2016-08-01 12:06:58       V6          16       0        NA
4  2016-08-01 13:12:12       V6          16       1        V7 
5  2016-08-01 14:04:07       V7          17       0        NA
6  2016-08-01 15:04:07       V6          17       0        NA
7  2016-08-01 17:13:16       V7          16       0        NA
8  2016-08-01 17:21:16       V7          17       0        NA
9  2016-08-01 17:21:34       V7          16       0        NA
10 2016-08-01 17:23:42       V6          16       0        NA
11 2016-08-01 17:27:16       V6          17       2     V9 V4 
12 2016-08-01 17:29:28       V7          16       0        NA

Note1: In `Result[4,]` there is a detection of the individual `16` at 13:12:12 and in an interval of 55s there is another detection (indicated in `Num_Rec`) in the `Rec` number `V7` (indicated in `Which_Rec`).

Note2: In `Result[11,]` there is one detection of the individual `17` at 17:27:16 in `Rec` `V6`, and after that, in an interval of  55s, there are two more TRUE detections, as it is indicated in `Num_Rec` with a `2`. In `Which_Rec` we specify the name of the receivers. In this case:`V9` and `V4`. We have also a FALSE detection in this interval of 55s that starts at 17:27:16. It is in row 13 in `df1` (It is a false detection because an animal can't be detected twice for the same `Rec` in 55s).

我想知道如何使用大型数据框执行此操作。我猜包 dplyr 是可能的,但我不知道怎么做。

我试过了,正如 Whosebug 同事在回答中提出的那样:

    library(tidyverse)

    df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
      mutate(Rec = as.character(Rec),
             DateTime = as.POSIXct(as.character(DateTime))) %>% 
      as_tibble()

First I define a delete_flag by checking if the same individual has been caught more than  once within 55 seconds. Then I filter the data accordingly.
Next I use `pmap` to get `Num_Rec` and `Which_Rec`:

    df %>% 
      mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55, 
                                                           Ind == .y) %>% nrow %>% as.logical())) %>% 
      filter(!delete_flag) %>%
      select(-delete_flag) %>% 
      mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
                                             Rec != ..2, Ind == ..3) %>% 
                        summarise(Num_Rec = n(),
                                  Which_Rec = paste0(Rec, collapse = " ")))) %>% 
      unnest()

       DateTime            Rec     Ind Num_Rec Which_Rec
       <dttm>              <chr> <dbl>   <int> <chr>    
     1 2016-08-01 12:04:07 V6       16       0 ""       
     2 2016-08-01 12:06:07 V7       17       0 ""       
     3 2016-08-01 12:06:58 V6       16       0 ""       
     4 2016-08-01 13:12:12 V6       16       1 V7       
     5 2016-08-01 14:04:07 V7       17       0 ""       
     6 2016-08-01 15:04:07 V6       17       0 ""       
     7 2016-08-01 17:13:16 V7       16       0 ""       
     8 2016-08-01 17:21:16 V7       17       0 ""       
     9 2016-08-01 17:21:34 V7       16       0 ""       
    10 2016-08-01 17:23:42 V6       16       0 ""       
    11 2016-08-01 17:27:16 V6       17       2 V9 V4    
    12 2016-08-01 17:29:28 V7       16       0 "" 

但是我应用你上面看到的代码得到的和他得到的不一样,我不知道为什么:

# A tibble: 12 x 5
   DateTime            Rec     Ind Num_Rec Which_Rec
   <dttm>              <chr> <dbl>   <int> <chr>    
 1 2016-08-01 12:04:07 V6       16      12 ""       
 2 2016-08-01 12:06:07 V7       17      12 ""       
 3 2016-08-01 12:06:58 V6       16      12 ""       
 4 2016-08-01 13:12:12 V6       16      12 V7       
 5 2016-08-01 14:04:07 V7       17      12 ""       
 6 2016-08-01 15:04:07 V6       17      12 ""       
 7 2016-08-01 17:13:16 V7       16      12 ""       
 8 2016-08-01 17:21:16 V7       17      12 ""       
 9 2016-08-01 17:21:34 V7       16      12 ""       
10 2016-08-01 17:23:42 V6       16      12 ""       
11 2016-08-01 17:27:16 V6       17      12 V9 V4    
12 2016-08-01 17:29:28 V7       16      12 ""       

这里有一个可能的解决方案,使用 purrr 包中的 map2pmap

首先,这是我正在处理的数据:

library(tidyverse)

df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
  mutate(Rec = as.character(Rec),
         DateTime = as.POSIXct(as.character(DateTime))) %>% 
  as_tibble()

首先,我通过检查同一个人是否在 55 秒内被多次捕获来定义 delete_flag。然后我相应地过滤数据。 接下来我使用 pmap 得到 Num_RecWhich_Rec:

df %>% 
  mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55, 
                                                       Ind == .y) %>% nrow %>% as.logical())) %>% 
  filter(!delete_flag) %>%
  select(-delete_flag) %>% 
  mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
                                         Rec != ..2, Ind == ..3) %>% 
                    summarise(Num_Rec = n(),
                              Which_Rec = paste0(Rec, collapse = " ")))) %>% 
  unnest()

   DateTime            Rec     Ind Num_Rec Which_Rec
   <dttm>              <chr> <dbl>   <int> <chr>    
 1 2016-08-01 12:04:07 V6       16       0 ""       
 2 2016-08-01 12:06:07 V7       17       0 ""       
 3 2016-08-01 12:06:58 V6       16       0 ""       
 4 2016-08-01 13:12:12 V6       16       1 V7       
 5 2016-08-01 14:04:07 V7       17       0 ""       
 6 2016-08-01 15:04:07 V6       17       0 ""       
 7 2016-08-01 17:13:16 V7       16       0 ""       
 8 2016-08-01 17:21:16 V7       17       0 ""       
 9 2016-08-01 17:21:34 V7       16       0 ""       
10 2016-08-01 17:23:42 V6       16       0 ""       
11 2016-08-01 17:27:16 V6       17       2 V9 V4    
12 2016-08-01 17:29:28 V7       16       0 ""