R如何找​​到具有特定值的最新行

R How can I find the most recent row with a certain value

晚上好,

我在 R 中有一个非常大的数据集,我正试图找到循环遍历它以解决一些问题的最佳方法。将数据想象成员工的历史工作时间。它看起来像:

rawTable:

Department      Name      Date         Hours

Engineering     Mary      2021-01-01   8
Engineering     Mary      2021-01-02   8
Engineering     Mary      2021-01-03   0
Engineering     Mary      2021-01-04   6
Sales           Barry     2021-01-01   0
Sales           Barry     2021-01-02   12
Sales           Barry     2021-01-03   12
Sales           Barry     2021-01-04   12    

我的名单上大约有 3,200 人,一年中的每一天都是一行,所以 table 显然很大。

我需要向 table 添加两列:

第一个是显示(每天)他们最后一天休息的 LDO

第二个是 WSH,显示该人自最后一天休息后工作了多少小时。看起来像:

rawTable:

Department      Name      Date         Hours  LDO          WSH

Engineering     Mary      2021-01-01   8      2020-12-31   8
Engineering     Mary      2021-01-02   8      2020-12-31   16
Engineering     Mary      2021-01-03   0      2021-01-03   0
Engineering     Mary      2021-01-04   6      2021-01-03   6
Sales           Barry     2021-01-01   0      2021-01-01   0
Sales           Barry     2021-01-02   12     2021-01-01   12
Sales           Barry     2021-01-03   12     2021-01-01   24
Sales           Barry     2021-01-04   12     2021-01-01   36

我试过使用 for 循环让它逐行应用逻辑。对于每一行,如果小时数等于零,则 LDO=Date 且 WSH=0。如果不是,则 LDO=LDO 来自前一行,WSH=WSH 来自前几个小时。使用此大小设置,运行.

需要永远半

接下来我创建了一个函数,给定一行,使用大列表的副本,并基于“哪个”语句告诉我该人在该行之前 0 小时工作的最后一天的行号日期。这也花了半天。除此之外,我什至没有接触到 WSH 部分。看起来像:

rawLU <- rawTable

LDO = function(x) {
  max(c(0, which((rawLU$Name == x["Name"]) &
                   (rawLU$Hours == 0) & (rawLU$Date <= x[Date])
  )))
}

LastOff<-apply(rawTable,1,LDO)

我知道有更简单的方法,但我也知道我似乎想不通。

有人可以帮忙吗?提前致谢!

麦克

这是 dplyr -

的可能解决方案

如果Hours = 0获取Date值,使用fill获取其他行上的前一个非工作日期。 WSH可以用cumsum.

来计算
library(dplyr)
library(tidyr)

rawTable %>%
  mutate(Date = as.Date(Date)) %>%
  group_by(Department, Name) %>%
  mutate(LDO = if_else(Hours == 0, Date, as.Date(NA))) %>%
  fill(LDO) %>%
  mutate(LDO = if_else(is.na(LDO), min(Date) - 1, LDO)) %>%
  group_by(LDO, .add = TRUE) %>%
  mutate(WSH = cumsum(Hours)) %>%
  ungroup

#  Department  Name  Date       Hours LDO          WSH
#  <chr>       <chr> <date>     <int> <date>     <int>
#1 Engineering Mary  2021-01-01     8 2020-12-31     8
#2 Engineering Mary  2021-01-02     8 2020-12-31    16
#3 Engineering Mary  2021-01-03     0 2021-01-03     0
#4 Engineering Mary  2021-01-04     6 2021-01-03     6
#5 Sales       Barry 2021-01-01     0 2021-01-01     0
#6 Sales       Barry 2021-01-02    12 2021-01-01    12
#7 Sales       Barry 2021-01-03    12 2021-01-01    24
#8 Sales       Barry 2021-01-04    12 2021-01-01    36

数据

rawTable <- structure(list(Department = c("Engineering", "Engineering", "Engineering", 
"Engineering", "Sales", "Sales", "Sales", "Sales"), Name = c("Mary", 
"Mary", "Mary", "Mary", "Barry", "Barry", "Barry", "Barry"), 
    Date = c("2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04", 
    "2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04"), 
    Hours = c(8L, 8L, 0L, 6L, 0L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, -8L))
df1 %>%
   group_by(Department, Name, grp = cumsum(Hours==0)) %>%
   mutate(Date = as.Date(Date),
      LDO = first(Date) - (first(Hours)>0),
      WHS = cumsum(Hours))

# A tibble: 8 x 7
# Groups:   Department, Name, grp [3]
  Department  Name  Date       Hours   grp LDO          WHS
  <chr>       <chr> <date>     <int> <int> <date>     <int>
1 Engineering Mary  2021-01-01     8     0 2020-12-31     8
2 Engineering Mary  2021-01-02     8     0 2020-12-31    16
3 Engineering Mary  2021-01-03     0     1 2021-01-03     0
4 Engineering Mary  2021-01-04     6     1 2021-01-03     6
5 Sales       Barry 2021-01-01     0     2 2021-01-01     0
6 Sales       Barry 2021-01-02    12     2 2021-01-01    12
7 Sales       Barry 2021-01-03    12     2 2021-01-01    24
8 Sales       Barry 2021-01-04    12     2 2021-01-01    36