如何按组记录第一个观察实例?

How to record the first instance of an observation by group?

我有一个包含图书馆借出数据的数据集。它记录了项目签出的确切日期和时间。在数据的时间跨度内,新项目进入数据集并首次检出。这些项目会不时更改其 class 化。这就是我感兴趣的。所以我想做的是获取我的数据集并创建一个新数据集,该数据集记录每周(开始)每个项目的值。

数据

library(dplyr)

# Simulate data
set.seed(1)
item <- rep(1:10, each = 10)
timedate <- as.POSIXct("2010-01-01 00:00:00") + runif(n=100, min=0, max=31*24*60*60)
classification <-  sample(c(NA, letters[1:4]), 100, replace = T)
my_df <- tibble(item, timedate, classification) %>% 
  # Taking a random subset by group
  group_nest(item, keep= TRUE)  %>% 
  add_column(mysamples = sample(6:10, 10, replace = T)) %>% 
  mutate(sampled  = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
  .$sampled %>%
  bind_rows() %>%
  arrange(item, timedate)

# This is what my simulated data looks like
my_df

   item            timedate classification
1     1 2010-01-02 21:58:08              a
2     1 2010-01-07 06:03:04              d
3     1 2010-01-12 12:51:36              c
4     1 2010-01-20 12:03:39              a
5     1 2010-01-21 11:38:00              b
6     1 2010-01-28 20:24:06           <NA>
7     1 2010-01-29 03:42:23              d
8     1 2010-01-30 06:50:18           <NA>
9     2 2010-01-06 11:21:29              a
10    2 2010-01-07 09:14:42              b
11    2 2010-01-12 18:44:46              b
12    2 2010-01-12 21:46:23              b
13    2 2010-01-16 10:17:17              a
14    2 2010-01-22 07:08:41              c
15    2 2010-01-23 05:54:29              a

Objective

以前5周为例

week <- seq(as.Date("2010-01-01"), as.Date("2010-02-01"), 7)

这些是我认为需要完成的说明才能实现我想要的:

  1. 选择一个星期间隔(从第一个开始)
  2. 对所有项目执行以下操作
  3. 如果给定项目在间隔期间签出,则按所有签出日期对项目排序,然后 return 第一个签出日期的 class。由于数据集已经记录了一些classes为NA,所以这个class可能是一个字母或者NA。
  4. 如果该项目在周间隔期间未签出,请检查该项目之前是否已签出。如果是这样,return 最后可用的 class
  5. 如果项目在之前的任何时间间隔内都没有被签出,return“从未签出”
  6. 所有周间隔重复

新数据集总共应包含 length(week)*length(unique(item)) 行。

它将如下所示:

item                week classification
   1                   1              a
   1                   2              c
   1                   3              a
   1                   4              NA
   1                   5              d
   ... # looking for the first case where there is no observation in week 1
   8                   1              "never checked out"
   8                   2              b

到目前为止我做了什么

到目前为止我最接近的是使用 lubridate 包中的间隔函数。我已经使用以下函数创建了一些间隔,现在剩下的就是检查在任何特定间隔中某个项目是否具有新的 classification,如果有,则使用它,如果没有,则使用旧的,如果没有'不存在“从未签出”的代码。

library(lubridate)

myintervals <- map2(head(week, -1), tail(week, -1), function(x,y ) interval(x,y-1))

### Here I just filter out those observations which have multiples of the same date. Not sure if I'm on the right track.

my_df %>%
  group_by(item, as.Date(timedate)) %>% 
  filter(timedate == min(timedate))

请注意, 没有使用我的模拟数据,因此不适用于我的情况。

我们可以在创建 'week' 列后执行 bind_rows

out <- tibble(item, timedate, classification) %>%
  group_nest(item, keep= TRUE)  %>% 
  add_column(mysamples = sample(6:10, 10, replace = TRUE)) %>% 
  mutate(sampled  = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
  pull(sampled) %>%
  map_dfr(~ .x %>% 
        slice(seq_len(length(week))) %>%
        mutate(classification = replace(classification, 
         cumsum(!is.na(classification)) == 0, 'never checked out'),
        week = week)) 



tail(out, 15)
# A tibble: 15 x 4
    item timedate            classification    week      
   <int> <dttm>              <chr>             <date>    
 1     8 2010-01-25 02:19:35 never checked out 2010-01-01
 2     8 2010-01-11 12:16:12 never checked out 2010-01-08
 3     8 2010-01-15 18:24:19 b                 2010-01-15
 4     8 2010-01-30 18:41:59 a                 2010-01-22
 5     8 2010-01-28 15:47:44 b                 2010-01-29
 6     9 2010-01-11 02:03:43 c                 2010-01-01
 7     9 2010-01-13 09:35:44 a                 2010-01-08
 8     9 2010-01-23 02:06:39 a                 2010-01-15
 9     9 2010-01-14 11:23:11 c                 2010-01-22
10     9 2010-01-23 01:04:27 d                 2010-01-29
11    10 2010-01-13 17:06:09 c                 2010-01-01
12    10 2010-01-25 03:30:45 b                 2010-01-08
13    10 2010-01-28 03:56:39 <NA>              2010-01-15
14    10 2010-01-02 19:50:49 c                 2010-01-22
15    10 2010-01-15 02:43:27 <NA>              2010-01-29
library(tidyverse)
library(lubridate)

#using only my_df from above - NOT the week object

my_df %>%
  mutate(week = week(timedate), #generate week variable
         year = year(timedate), #not called for, but probably useful,
         classification = ifelse(is.na(classification), "NA", classification) #convert NA to character NA, to handle step 4
  ) %>%
  group_by(item, week) %>% #step 2, 1
  arrange(timedate) %>% #step 3
  summarize(new_class = first(classification)) %>% #step 3
  ungroup() %>% #cleanse
  complete(item, week) %>% #expand the df for all item-weeks
  group_by(item) %>% arrange(week) %>% #prep for 4
  fill(new_class, .direction = "down") %>% #step 4 - fill downwards
  mutate(new_class = replace_na(new_class, "never checked out")) %>% #step 5 %>%
  ungroup() %>% #cleanse
  arrange(item, week)  #match display 
#for multi-year data, anytime you group by week also group by year