从R中的字符中提取持续时间

Extract a duration from character in R

我目前遇到一个需要分析的数据集问题。以下是这些数据的示例:

      session_id    individ_id  colony     species           year_tracked
1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar      2009_10
2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar      2010_11
3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar      2009_10
4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar      2010_12
5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar      2009_10
6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar      2009_10

我需要创建一个包含跟踪年数的新列,在这种情况下为:

2010-2009 --> 1
2011-2010 --> 1
2010-2009 --> 1
2012-2010 --> 2
2010-2009 --> 1
2010-2009 --> 1

year_tracked 列是 character class。也许使用单元格的前 4 个字符和最后两个字符并将其转换为日期的函数可以工作,但我不知道该怎么做。

这里有一些正则表达式: 先提取第一年的四个数字str_extract(.,"[0-9]{4}"),然后提取第二年的str_extract(.,"(?<=_)[0-9]{2}"),通过加20转成YYYY格式,再减去两个

library(magrittr)
library(stringr)

from <- df$year_tracked %>%
  str_extract(.,"[0-9]{4}") %>%
  as.numeric()

to <- df$year_tracked %>%
  str_extract(.,"(?<=_)[0-9]{2}") %>%
  paste0("20",.) %>%
  as.numeric()

result <- to - from

[1] 1 1 1 2 1 1

数据:

df <- read.table(text = "      session_id    individ_id  colony     species           year_tracked
 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar      2009_10
 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar      2010_11
 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar      2009_10
 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar      2010_12
 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar      2009_10
 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar      2009_10",header = T)

选项separate

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
    mutate(year_tracked2 = str_replace(year_tracked, "_", "_20")) %>% 
    separate(year_tracked2, into = c('year1', 'year2'), convert = TRUE) %>%
    mutate(n = year2 - year1) %>%
    select(-year1, -year2)
#       session_id  individ_id    colony         species year_tracked n
#1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar      2009_10 1
#2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar      2010_11 1
#3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar      2009_10 1
#4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar      2010_12 2
#5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar      2009_10 1
#6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar      2009_10 1

或者更简单的选择是将 _ 替换为 :20 并只执行 eval 计算

library(purrr)
df1 %>% 
   mutate(n = lengths(map(str_replace(year_tracked, "_", ":20"),
           ~ eval(parse(text = .x))))- 1)

数据

df1 <- structure(list(session_id = c("12141_2009-07-01", "12141_2010-07-18", 
"12143_2009-07-01", "12143_2010-07-18", "12144_2009-07-01", "12145_2009-07-01"
), individ_id = c("GBT_FP96194", "GBT_FP96235", "GBT_FC14766", 
"GBT_FR77883", "GBT_FP05030", "GBT_FA82356"), colony = c("Eynhallow", 
"Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow"
), species = c("Northern fulmar", "Northern fulmar", "Northern fulmar", 
"Northern fulmar", "Northern fulmar", "Northern fulmar"), year_tracked = c("2009_10", 
"2010_11", "2009_10", "2010_12", "2009_10", "2009_10")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))