如何按组记录第一个观察实例?
How to record the first instance of an observation by group?
我有一个包含图书馆借出数据的数据集。它记录了项目签出的确切日期和时间。在数据的时间跨度内,新项目进入数据集并首次检出。这些项目会不时更改其 class 化。这就是我感兴趣的。所以我想做的是获取我的数据集并创建一个新数据集,该数据集记录每周(开始)每个项目的值。
数据
library(dplyr)
# Simulate data
set.seed(1)
item <- rep(1:10, each = 10)
timedate <- as.POSIXct("2010-01-01 00:00:00") + runif(n=100, min=0, max=31*24*60*60)
classification <- sample(c(NA, letters[1:4]), 100, replace = T)
my_df <- tibble(item, timedate, classification) %>%
# Taking a random subset by group
group_nest(item, keep= TRUE) %>%
add_column(mysamples = sample(6:10, 10, replace = T)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows() %>%
arrange(item, timedate)
# This is what my simulated data looks like
my_df
item timedate classification
1 1 2010-01-02 21:58:08 a
2 1 2010-01-07 06:03:04 d
3 1 2010-01-12 12:51:36 c
4 1 2010-01-20 12:03:39 a
5 1 2010-01-21 11:38:00 b
6 1 2010-01-28 20:24:06 <NA>
7 1 2010-01-29 03:42:23 d
8 1 2010-01-30 06:50:18 <NA>
9 2 2010-01-06 11:21:29 a
10 2 2010-01-07 09:14:42 b
11 2 2010-01-12 18:44:46 b
12 2 2010-01-12 21:46:23 b
13 2 2010-01-16 10:17:17 a
14 2 2010-01-22 07:08:41 c
15 2 2010-01-23 05:54:29 a
Objective
以前5周为例
week <- seq(as.Date("2010-01-01"), as.Date("2010-02-01"), 7)
这些是我认为需要完成的说明才能实现我想要的:
- 选择一个星期间隔(从第一个开始)
- 对所有项目执行以下操作
- 如果给定项目在间隔期间签出,则按所有签出日期对项目排序,然后 return 第一个签出日期的 class。由于数据集已经记录了一些classes为NA,所以这个class可能是一个字母或者NA。
- 如果该项目在周间隔期间未签出,请检查该项目之前是否已签出。如果是这样,return 最后可用的 class
- 如果项目在之前的任何时间间隔内都没有被签出,return“从未签出”
- 所有周间隔重复
新数据集总共应包含 length(week)*length(unique(item))
行。
它将如下所示:
item week classification
1 1 a
1 2 c
1 3 a
1 4 NA
1 5 d
... # looking for the first case where there is no observation in week 1
8 1 "never checked out"
8 2 b
到目前为止我做了什么
到目前为止我最接近的是使用 lubridate 包中的间隔函数。我已经使用以下函数创建了一些间隔,现在剩下的就是检查在任何特定间隔中某个项目是否具有新的 classification,如果有,则使用它,如果没有,则使用旧的,如果没有'不存在“从未签出”的代码。
library(lubridate)
myintervals <- map2(head(week, -1), tail(week, -1), function(x,y ) interval(x,y-1))
### Here I just filter out those observations which have multiples of the same date. Not sure if I'm on the right track.
my_df %>%
group_by(item, as.Date(timedate)) %>%
filter(timedate == min(timedate))
请注意, 没有使用我的模拟数据,因此不适用于我的情况。
我们可以在创建 'week' 列后执行 bind_rows
out <- tibble(item, timedate, classification) %>%
group_nest(item, keep= TRUE) %>%
add_column(mysamples = sample(6:10, 10, replace = TRUE)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
pull(sampled) %>%
map_dfr(~ .x %>%
slice(seq_len(length(week))) %>%
mutate(classification = replace(classification,
cumsum(!is.na(classification)) == 0, 'never checked out'),
week = week))
tail(out, 15)
# A tibble: 15 x 4
item timedate classification week
<int> <dttm> <chr> <date>
1 8 2010-01-25 02:19:35 never checked out 2010-01-01
2 8 2010-01-11 12:16:12 never checked out 2010-01-08
3 8 2010-01-15 18:24:19 b 2010-01-15
4 8 2010-01-30 18:41:59 a 2010-01-22
5 8 2010-01-28 15:47:44 b 2010-01-29
6 9 2010-01-11 02:03:43 c 2010-01-01
7 9 2010-01-13 09:35:44 a 2010-01-08
8 9 2010-01-23 02:06:39 a 2010-01-15
9 9 2010-01-14 11:23:11 c 2010-01-22
10 9 2010-01-23 01:04:27 d 2010-01-29
11 10 2010-01-13 17:06:09 c 2010-01-01
12 10 2010-01-25 03:30:45 b 2010-01-08
13 10 2010-01-28 03:56:39 <NA> 2010-01-15
14 10 2010-01-02 19:50:49 c 2010-01-22
15 10 2010-01-15 02:43:27 <NA> 2010-01-29
library(tidyverse)
library(lubridate)
#using only my_df from above - NOT the week object
my_df %>%
mutate(week = week(timedate), #generate week variable
year = year(timedate), #not called for, but probably useful,
classification = ifelse(is.na(classification), "NA", classification) #convert NA to character NA, to handle step 4
) %>%
group_by(item, week) %>% #step 2, 1
arrange(timedate) %>% #step 3
summarize(new_class = first(classification)) %>% #step 3
ungroup() %>% #cleanse
complete(item, week) %>% #expand the df for all item-weeks
group_by(item) %>% arrange(week) %>% #prep for 4
fill(new_class, .direction = "down") %>% #step 4 - fill downwards
mutate(new_class = replace_na(new_class, "never checked out")) %>% #step 5 %>%
ungroup() %>% #cleanse
arrange(item, week) #match display
#for multi-year data, anytime you group by week also group by year
我有一个包含图书馆借出数据的数据集。它记录了项目签出的确切日期和时间。在数据的时间跨度内,新项目进入数据集并首次检出。这些项目会不时更改其 class 化。这就是我感兴趣的。所以我想做的是获取我的数据集并创建一个新数据集,该数据集记录每周(开始)每个项目的值。
数据
library(dplyr)
# Simulate data
set.seed(1)
item <- rep(1:10, each = 10)
timedate <- as.POSIXct("2010-01-01 00:00:00") + runif(n=100, min=0, max=31*24*60*60)
classification <- sample(c(NA, letters[1:4]), 100, replace = T)
my_df <- tibble(item, timedate, classification) %>%
# Taking a random subset by group
group_nest(item, keep= TRUE) %>%
add_column(mysamples = sample(6:10, 10, replace = T)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows() %>%
arrange(item, timedate)
# This is what my simulated data looks like
my_df
item timedate classification
1 1 2010-01-02 21:58:08 a
2 1 2010-01-07 06:03:04 d
3 1 2010-01-12 12:51:36 c
4 1 2010-01-20 12:03:39 a
5 1 2010-01-21 11:38:00 b
6 1 2010-01-28 20:24:06 <NA>
7 1 2010-01-29 03:42:23 d
8 1 2010-01-30 06:50:18 <NA>
9 2 2010-01-06 11:21:29 a
10 2 2010-01-07 09:14:42 b
11 2 2010-01-12 18:44:46 b
12 2 2010-01-12 21:46:23 b
13 2 2010-01-16 10:17:17 a
14 2 2010-01-22 07:08:41 c
15 2 2010-01-23 05:54:29 a
Objective
以前5周为例
week <- seq(as.Date("2010-01-01"), as.Date("2010-02-01"), 7)
这些是我认为需要完成的说明才能实现我想要的:
- 选择一个星期间隔(从第一个开始)
- 对所有项目执行以下操作
- 如果给定项目在间隔期间签出,则按所有签出日期对项目排序,然后 return 第一个签出日期的 class。由于数据集已经记录了一些classes为NA,所以这个class可能是一个字母或者NA。
- 如果该项目在周间隔期间未签出,请检查该项目之前是否已签出。如果是这样,return 最后可用的 class
- 如果项目在之前的任何时间间隔内都没有被签出,return“从未签出”
- 所有周间隔重复
新数据集总共应包含 length(week)*length(unique(item))
行。
它将如下所示:
item week classification
1 1 a
1 2 c
1 3 a
1 4 NA
1 5 d
... # looking for the first case where there is no observation in week 1
8 1 "never checked out"
8 2 b
到目前为止我做了什么
到目前为止我最接近的是使用 lubridate 包中的间隔函数。我已经使用以下函数创建了一些间隔,现在剩下的就是检查在任何特定间隔中某个项目是否具有新的 classification,如果有,则使用它,如果没有,则使用旧的,如果没有'不存在“从未签出”的代码。
library(lubridate)
myintervals <- map2(head(week, -1), tail(week, -1), function(x,y ) interval(x,y-1))
### Here I just filter out those observations which have multiples of the same date. Not sure if I'm on the right track.
my_df %>%
group_by(item, as.Date(timedate)) %>%
filter(timedate == min(timedate))
请注意,
我们可以在创建 'week' 列后执行 bind_rows
out <- tibble(item, timedate, classification) %>%
group_nest(item, keep= TRUE) %>%
add_column(mysamples = sample(6:10, 10, replace = TRUE)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
pull(sampled) %>%
map_dfr(~ .x %>%
slice(seq_len(length(week))) %>%
mutate(classification = replace(classification,
cumsum(!is.na(classification)) == 0, 'never checked out'),
week = week))
tail(out, 15)
# A tibble: 15 x 4
item timedate classification week
<int> <dttm> <chr> <date>
1 8 2010-01-25 02:19:35 never checked out 2010-01-01
2 8 2010-01-11 12:16:12 never checked out 2010-01-08
3 8 2010-01-15 18:24:19 b 2010-01-15
4 8 2010-01-30 18:41:59 a 2010-01-22
5 8 2010-01-28 15:47:44 b 2010-01-29
6 9 2010-01-11 02:03:43 c 2010-01-01
7 9 2010-01-13 09:35:44 a 2010-01-08
8 9 2010-01-23 02:06:39 a 2010-01-15
9 9 2010-01-14 11:23:11 c 2010-01-22
10 9 2010-01-23 01:04:27 d 2010-01-29
11 10 2010-01-13 17:06:09 c 2010-01-01
12 10 2010-01-25 03:30:45 b 2010-01-08
13 10 2010-01-28 03:56:39 <NA> 2010-01-15
14 10 2010-01-02 19:50:49 c 2010-01-22
15 10 2010-01-15 02:43:27 <NA> 2010-01-29
library(tidyverse)
library(lubridate)
#using only my_df from above - NOT the week object
my_df %>%
mutate(week = week(timedate), #generate week variable
year = year(timedate), #not called for, but probably useful,
classification = ifelse(is.na(classification), "NA", classification) #convert NA to character NA, to handle step 4
) %>%
group_by(item, week) %>% #step 2, 1
arrange(timedate) %>% #step 3
summarize(new_class = first(classification)) %>% #step 3
ungroup() %>% #cleanse
complete(item, week) %>% #expand the df for all item-weeks
group_by(item) %>% arrange(week) %>% #prep for 4
fill(new_class, .direction = "down") %>% #step 4 - fill downwards
mutate(new_class = replace_na(new_class, "never checked out")) %>% #step 5 %>%
ungroup() %>% #cleanse
arrange(item, week) #match display
#for multi-year data, anytime you group by week also group by year