从R中的字符中提取持续时间
Extract a duration from character in R
我目前遇到一个需要分析的数据集问题。以下是这些数据的示例:
session_id individ_id colony species year_tracked
1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10
2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11
3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10
4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12
5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10
6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10
我需要创建一个包含跟踪年数的新列,在这种情况下为:
2010-2009 --> 1
2011-2010 --> 1
2010-2009 --> 1
2012-2010 --> 2
2010-2009 --> 1
2010-2009 --> 1
year_tracked
列是 character
class。也许使用单元格的前 4 个字符和最后两个字符并将其转换为日期的函数可以工作,但我不知道该怎么做。
这里有一些正则表达式:
先提取第一年的四个数字str_extract(.,"[0-9]{4}")
,然后提取第二年的str_extract(.,"(?<=_)[0-9]{2}")
,通过加20
转成YYYY格式,再减去两个
library(magrittr)
library(stringr)
from <- df$year_tracked %>%
str_extract(.,"[0-9]{4}") %>%
as.numeric()
to <- df$year_tracked %>%
str_extract(.,"(?<=_)[0-9]{2}") %>%
paste0("20",.) %>%
as.numeric()
result <- to - from
[1] 1 1 1 2 1 1
数据:
df <- read.table(text = " session_id individ_id colony species year_tracked
12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10
12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11
12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10
12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12
12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10
12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10",header = T)
选项separate
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(year_tracked2 = str_replace(year_tracked, "_", "_20")) %>%
separate(year_tracked2, into = c('year1', 'year2'), convert = TRUE) %>%
mutate(n = year2 - year1) %>%
select(-year1, -year2)
# session_id individ_id colony species year_tracked n
#1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10 1
#2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11 1
#3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10 1
#4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12 2
#5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10 1
#6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10 1
或者更简单的选择是将 _
替换为 :20
并只执行 eval
计算
library(purrr)
df1 %>%
mutate(n = lengths(map(str_replace(year_tracked, "_", ":20"),
~ eval(parse(text = .x))))- 1)
数据
df1 <- structure(list(session_id = c("12141_2009-07-01", "12141_2010-07-18",
"12143_2009-07-01", "12143_2010-07-18", "12144_2009-07-01", "12145_2009-07-01"
), individ_id = c("GBT_FP96194", "GBT_FP96235", "GBT_FC14766",
"GBT_FR77883", "GBT_FP05030", "GBT_FA82356"), colony = c("Eynhallow",
"Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow"
), species = c("Northern fulmar", "Northern fulmar", "Northern fulmar",
"Northern fulmar", "Northern fulmar", "Northern fulmar"), year_tracked = c("2009_10",
"2010_11", "2009_10", "2010_12", "2009_10", "2009_10")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
我目前遇到一个需要分析的数据集问题。以下是这些数据的示例:
session_id individ_id colony species year_tracked
1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10
2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11
3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10
4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12
5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10
6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10
我需要创建一个包含跟踪年数的新列,在这种情况下为:
2010-2009 --> 1
2011-2010 --> 1
2010-2009 --> 1
2012-2010 --> 2
2010-2009 --> 1
2010-2009 --> 1
year_tracked
列是 character
class。也许使用单元格的前 4 个字符和最后两个字符并将其转换为日期的函数可以工作,但我不知道该怎么做。
这里有一些正则表达式:
先提取第一年的四个数字str_extract(.,"[0-9]{4}")
,然后提取第二年的str_extract(.,"(?<=_)[0-9]{2}")
,通过加20
转成YYYY格式,再减去两个
library(magrittr)
library(stringr)
from <- df$year_tracked %>%
str_extract(.,"[0-9]{4}") %>%
as.numeric()
to <- df$year_tracked %>%
str_extract(.,"(?<=_)[0-9]{2}") %>%
paste0("20",.) %>%
as.numeric()
result <- to - from
[1] 1 1 1 2 1 1
数据:
df <- read.table(text = " session_id individ_id colony species year_tracked
12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10
12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11
12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10
12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12
12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10
12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10",header = T)
选项separate
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(year_tracked2 = str_replace(year_tracked, "_", "_20")) %>%
separate(year_tracked2, into = c('year1', 'year2'), convert = TRUE) %>%
mutate(n = year2 - year1) %>%
select(-year1, -year2)
# session_id individ_id colony species year_tracked n
#1 12141_2009-07-01 GBT_FP96194 Eynhallow Northern fulmar 2009_10 1
#2 12141_2010-07-18 GBT_FP96235 Eynhallow Northern fulmar 2010_11 1
#3 12143_2009-07-01 GBT_FC14766 Eynhallow Northern fulmar 2009_10 1
#4 12143_2010-07-18 GBT_FR77883 Eynhallow Northern fulmar 2010_12 2
#5 12144_2009-07-01 GBT_FP05030 Eynhallow Northern fulmar 2009_10 1
#6 12145_2009-07-01 GBT_FA82356 Eynhallow Northern fulmar 2009_10 1
或者更简单的选择是将 _
替换为 :20
并只执行 eval
计算
library(purrr)
df1 %>%
mutate(n = lengths(map(str_replace(year_tracked, "_", ":20"),
~ eval(parse(text = .x))))- 1)
数据
df1 <- structure(list(session_id = c("12141_2009-07-01", "12141_2010-07-18",
"12143_2009-07-01", "12143_2010-07-18", "12144_2009-07-01", "12145_2009-07-01"
), individ_id = c("GBT_FP96194", "GBT_FP96235", "GBT_FC14766",
"GBT_FR77883", "GBT_FP05030", "GBT_FA82356"), colony = c("Eynhallow",
"Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow", "Eynhallow"
), species = c("Northern fulmar", "Northern fulmar", "Northern fulmar",
"Northern fulmar", "Northern fulmar", "Northern fulmar"), year_tracked = c("2009_10",
"2010_11", "2009_10", "2010_12", "2009_10", "2009_10")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))