删除彼此靠近超过特定时间的行,并在两个新列中添加有关已删除行的信息
Delete rows that are close to each other more than a specific time and add info about the deleted rows in two new columns
我有一个数据框 df1
,它总结了随着时间的推移对不同动物的检测。 Rec
列指定哪个设备检测到它(V4
、V6
等),Ind
列指定个人。
我想删除满足以下条件的行:"there is a detection for the same animal within the previous 55 seconds"(如果检测来自不同的接收器则无关紧要)。
此外,我想创建这些列:
1) Num_Rec
:它总结了在提到的 55 秒间隔内有多少其他 Rec
检测到动物。
2) Which_Rec
:它总结了在上述 55 秒间隔内检测到动物的其他 Rec
的名称。
如果在 55 秒的间隔内同一只动物被同一只 Rec
捕获两次(即 df1
中的第 12 行和第 13 行),我认为第 2 行(= 检测)是一个错误(同一个接收者不可能在 55 秒内捕获同一个动物两次)并且我没有考虑 Num_Rec
和 Which_Rec
列中的这一行(即在 Result
中我不在 Result$Num_Rec[11]
和 Result$Which_Rec[11]
中都不算 df1$Datetime[13]
)。
举个例子:
df1<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))
df1$DateTime<- as.POSIXct(df1$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
df1
DateTime Rec Ind
1 2016-08-01 12:04:07 V6 16
2 2016-08-01 12:06:07 V7 17
3 2016-08-01 12:06:58 V6 16
4 2016-08-01 13:12:12 V6 16
5 2016-08-01 14:04:07 V7 17
6 2016-08-01 13:12:45 V7 16
7 2016-08-01 15:04:07 V6 17
8 2016-08-01 17:13:16 V7 16
9 2016-08-01 17:21:16 V7 17
10 2016-08-01 17:21:34 V7 16
11 2016-08-01 17:23:42 V6 16
12 2016-08-01 17:27:16 V6 17
13 2016-08-01 17:27:22 V6 17
14 2016-08-01 17:28:01 V9 17
15 2016-08-01 17:29:28 V7 16
16 2016-08-01 17:28:08 V4 17
我想得到的是:
Result
DateTime Rec Ind Num_Rec Which_Rec
1 2016-08-01 12:04:07 V6 16 0 NA
2 2016-08-01 12:06:07 V7 17 0 NA
3 2016-08-01 12:06:58 V6 16 0 NA
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 NA
6 2016-08-01 15:04:07 V6 17 0 NA
7 2016-08-01 17:13:16 V7 16 0 NA
8 2016-08-01 17:21:16 V7 17 0 NA
9 2016-08-01 17:21:34 V7 16 0 NA
10 2016-08-01 17:23:42 V6 16 0 NA
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 NA
Note1: In `Result[4,]` there is a detection of the individual `16` at 13:12:12 and in an interval of 55s there is another detection (indicated in `Num_Rec`) in the `Rec` number `V7` (indicated in `Which_Rec`).
Note2: In `Result[11,]` there is one detection of the individual `17` at 17:27:16 in `Rec` `V6`, and after that, in an interval of 55s, there are two more TRUE detections, as it is indicated in `Num_Rec` with a `2`. In `Which_Rec` we specify the name of the receivers. In this case:`V9` and `V4`. We have also a FALSE detection in this interval of 55s that starts at 17:27:16. It is in row 13 in `df1` (It is a false detection because an animal can't be detected twice for the same `Rec` in 55s).
我想知道如何使用大型数据框执行此操作。我猜包 dplyr
是可能的,但我不知道怎么做。
我试过了,正如 Whosebug 同事在回答中提出的那样:
library(tidyverse)
df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
mutate(Rec = as.character(Rec),
DateTime = as.POSIXct(as.character(DateTime))) %>%
as_tibble()
First I define a delete_flag by checking if the same individual has been caught more than once within 55 seconds. Then I filter the data accordingly.
Next I use `pmap` to get `Num_Rec` and `Which_Rec`:
df %>%
mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55,
Ind == .y) %>% nrow %>% as.logical())) %>%
filter(!delete_flag) %>%
select(-delete_flag) %>%
mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
Rec != ..2, Ind == ..3) %>%
summarise(Num_Rec = n(),
Which_Rec = paste0(Rec, collapse = " ")))) %>%
unnest()
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 0 ""
2 2016-08-01 12:06:07 V7 17 0 ""
3 2016-08-01 12:06:58 V6 16 0 ""
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 ""
6 2016-08-01 15:04:07 V6 17 0 ""
7 2016-08-01 17:13:16 V7 16 0 ""
8 2016-08-01 17:21:16 V7 17 0 ""
9 2016-08-01 17:21:34 V7 16 0 ""
10 2016-08-01 17:23:42 V6 16 0 ""
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 ""
但是我应用你上面看到的代码得到的和他得到的不一样,我不知道为什么:
# A tibble: 12 x 5
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 12 ""
2 2016-08-01 12:06:07 V7 17 12 ""
3 2016-08-01 12:06:58 V6 16 12 ""
4 2016-08-01 13:12:12 V6 16 12 V7
5 2016-08-01 14:04:07 V7 17 12 ""
6 2016-08-01 15:04:07 V6 17 12 ""
7 2016-08-01 17:13:16 V7 16 12 ""
8 2016-08-01 17:21:16 V7 17 12 ""
9 2016-08-01 17:21:34 V7 16 12 ""
10 2016-08-01 17:23:42 V6 16 12 ""
11 2016-08-01 17:27:16 V6 17 12 V9 V4
12 2016-08-01 17:29:28 V7 16 12 ""
这里有一个可能的解决方案,使用 purrr
包中的 map2
和 pmap
。
首先,这是我正在处理的数据:
library(tidyverse)
df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
mutate(Rec = as.character(Rec),
DateTime = as.POSIXct(as.character(DateTime))) %>%
as_tibble()
首先,我通过检查同一个人是否在 55 秒内被多次捕获来定义 delete_flag。然后我相应地过滤数据。
接下来我使用 pmap
得到 Num_Rec
和 Which_Rec
:
df %>%
mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55,
Ind == .y) %>% nrow %>% as.logical())) %>%
filter(!delete_flag) %>%
select(-delete_flag) %>%
mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
Rec != ..2, Ind == ..3) %>%
summarise(Num_Rec = n(),
Which_Rec = paste0(Rec, collapse = " ")))) %>%
unnest()
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 0 ""
2 2016-08-01 12:06:07 V7 17 0 ""
3 2016-08-01 12:06:58 V6 16 0 ""
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 ""
6 2016-08-01 15:04:07 V6 17 0 ""
7 2016-08-01 17:13:16 V7 16 0 ""
8 2016-08-01 17:21:16 V7 17 0 ""
9 2016-08-01 17:21:34 V7 16 0 ""
10 2016-08-01 17:23:42 V6 16 0 ""
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 ""
我有一个数据框 df1
,它总结了随着时间的推移对不同动物的检测。 Rec
列指定哪个设备检测到它(V4
、V6
等),Ind
列指定个人。
我想删除满足以下条件的行:"there is a detection for the same animal within the previous 55 seconds"(如果检测来自不同的接收器则无关紧要)。
此外,我想创建这些列:
1) Num_Rec
:它总结了在提到的 55 秒间隔内有多少其他 Rec
检测到动物。
2) Which_Rec
:它总结了在上述 55 秒间隔内检测到动物的其他 Rec
的名称。
如果在 55 秒的间隔内同一只动物被同一只 Rec
捕获两次(即 df1
中的第 12 行和第 13 行),我认为第 2 行(= 检测)是一个错误(同一个接收者不可能在 55 秒内捕获同一个动物两次)并且我没有考虑 Num_Rec
和 Which_Rec
列中的这一行(即在 Result
中我不在 Result$Num_Rec[11]
和 Result$Which_Rec[11]
中都不算 df1$Datetime[13]
)。
举个例子:
df1<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))
df1$DateTime<- as.POSIXct(df1$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
df1
DateTime Rec Ind
1 2016-08-01 12:04:07 V6 16
2 2016-08-01 12:06:07 V7 17
3 2016-08-01 12:06:58 V6 16
4 2016-08-01 13:12:12 V6 16
5 2016-08-01 14:04:07 V7 17
6 2016-08-01 13:12:45 V7 16
7 2016-08-01 15:04:07 V6 17
8 2016-08-01 17:13:16 V7 16
9 2016-08-01 17:21:16 V7 17
10 2016-08-01 17:21:34 V7 16
11 2016-08-01 17:23:42 V6 16
12 2016-08-01 17:27:16 V6 17
13 2016-08-01 17:27:22 V6 17
14 2016-08-01 17:28:01 V9 17
15 2016-08-01 17:29:28 V7 16
16 2016-08-01 17:28:08 V4 17
我想得到的是:
Result
DateTime Rec Ind Num_Rec Which_Rec
1 2016-08-01 12:04:07 V6 16 0 NA
2 2016-08-01 12:06:07 V7 17 0 NA
3 2016-08-01 12:06:58 V6 16 0 NA
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 NA
6 2016-08-01 15:04:07 V6 17 0 NA
7 2016-08-01 17:13:16 V7 16 0 NA
8 2016-08-01 17:21:16 V7 17 0 NA
9 2016-08-01 17:21:34 V7 16 0 NA
10 2016-08-01 17:23:42 V6 16 0 NA
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 NA
Note1: In `Result[4,]` there is a detection of the individual `16` at 13:12:12 and in an interval of 55s there is another detection (indicated in `Num_Rec`) in the `Rec` number `V7` (indicated in `Which_Rec`).
Note2: In `Result[11,]` there is one detection of the individual `17` at 17:27:16 in `Rec` `V6`, and after that, in an interval of 55s, there are two more TRUE detections, as it is indicated in `Num_Rec` with a `2`. In `Which_Rec` we specify the name of the receivers. In this case:`V9` and `V4`. We have also a FALSE detection in this interval of 55s that starts at 17:27:16. It is in row 13 in `df1` (It is a false detection because an animal can't be detected twice for the same `Rec` in 55s).
我想知道如何使用大型数据框执行此操作。我猜包 dplyr
是可能的,但我不知道怎么做。
我试过了,正如 Whosebug 同事在回答中提出的那样:
library(tidyverse)
df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
mutate(Rec = as.character(Rec),
DateTime = as.POSIXct(as.character(DateTime))) %>%
as_tibble()
First I define a delete_flag by checking if the same individual has been caught more than once within 55 seconds. Then I filter the data accordingly.
Next I use `pmap` to get `Num_Rec` and `Which_Rec`:
df %>%
mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55,
Ind == .y) %>% nrow %>% as.logical())) %>%
filter(!delete_flag) %>%
select(-delete_flag) %>%
mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
Rec != ..2, Ind == ..3) %>%
summarise(Num_Rec = n(),
Which_Rec = paste0(Rec, collapse = " ")))) %>%
unnest()
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 0 ""
2 2016-08-01 12:06:07 V7 17 0 ""
3 2016-08-01 12:06:58 V6 16 0 ""
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 ""
6 2016-08-01 15:04:07 V6 17 0 ""
7 2016-08-01 17:13:16 V7 16 0 ""
8 2016-08-01 17:21:16 V7 17 0 ""
9 2016-08-01 17:21:34 V7 16 0 ""
10 2016-08-01 17:23:42 V6 16 0 ""
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 ""
但是我应用你上面看到的代码得到的和他得到的不一样,我不知道为什么:
# A tibble: 12 x 5
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 12 ""
2 2016-08-01 12:06:07 V7 17 12 ""
3 2016-08-01 12:06:58 V6 16 12 ""
4 2016-08-01 13:12:12 V6 16 12 V7
5 2016-08-01 14:04:07 V7 17 12 ""
6 2016-08-01 15:04:07 V6 17 12 ""
7 2016-08-01 17:13:16 V7 16 12 ""
8 2016-08-01 17:21:16 V7 17 12 ""
9 2016-08-01 17:21:34 V7 16 12 ""
10 2016-08-01 17:23:42 V6 16 12 ""
11 2016-08-01 17:27:16 V6 17 12 V9 V4
12 2016-08-01 17:29:28 V7 16 12 ""
这里有一个可能的解决方案,使用 purrr
包中的 map2
和 pmap
。
首先,这是我正在处理的数据:
library(tidyverse)
df <- data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08"),Rec=c("V6", "V7", "V6", "V6", "V7", "V7", "V6", "V7", "V7","V7","V6","V6", "V6", "V9", "V7", "V4"),Ind=c(16, 17, 16, 16, 17, 16, 17, 16, 17, 16, 16, 17, 17, 17, 16, 17))%>%
mutate(Rec = as.character(Rec),
DateTime = as.POSIXct(as.character(DateTime))) %>%
as_tibble()
首先,我通过检查同一个人是否在 55 秒内被多次捕获来定义 delete_flag。然后我相应地过滤数据。
接下来我使用 pmap
得到 Num_Rec
和 Which_Rec
:
df %>%
mutate(delete_flag = map2_lgl(DateTime, Ind, ~filter(df, DateTime < .x, DateTime >= .x - 55,
Ind == .y) %>% nrow %>% as.logical())) %>%
filter(!delete_flag) %>%
select(-delete_flag) %>%
mutate(x = pmap(list(DateTime, Rec, Ind), ~filter(df, DateTime > ..1, DateTime <= ..1 +55,
Rec != ..2, Ind == ..3) %>%
summarise(Num_Rec = n(),
Which_Rec = paste0(Rec, collapse = " ")))) %>%
unnest()
DateTime Rec Ind Num_Rec Which_Rec
<dttm> <chr> <dbl> <int> <chr>
1 2016-08-01 12:04:07 V6 16 0 ""
2 2016-08-01 12:06:07 V7 17 0 ""
3 2016-08-01 12:06:58 V6 16 0 ""
4 2016-08-01 13:12:12 V6 16 1 V7
5 2016-08-01 14:04:07 V7 17 0 ""
6 2016-08-01 15:04:07 V6 17 0 ""
7 2016-08-01 17:13:16 V7 16 0 ""
8 2016-08-01 17:21:16 V7 17 0 ""
9 2016-08-01 17:21:34 V7 16 0 ""
10 2016-08-01 17:23:42 V6 16 0 ""
11 2016-08-01 17:27:16 V6 17 2 V9 V4
12 2016-08-01 17:29:28 V7 16 0 ""