根据 data.table 中的逻辑字符串匹配将值分配给新列

allocate values to a new column based on logical string matching in data.table

我有一个很大的学生数据集,其中有荣誉学生的非标准命名约定。我需要 create/populate 一个新列,它将 return 一个 Y 或 N 用于基于单词 "Honours"

的字符串匹配

目前我的数据看起来像这样,有超过 200,000 名学生

library(data.table)
students<-data.table(Student_ID = c(10001:10005), 
                    Degree= c("Bachelor of Laws", "Honours Degree in Commerce", "Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"))

我需要添加第三列,以便在我创建新列 'Honours' 数据 table 方式后,它将像这样填充:

students<-data.table(Student_ID = c(10001:10005), 
                      Degree= c("Bachelor of Laws", "Honours Degree in Commerce","Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"), 
                      Honours = c("N","Y", "Y", "Y","N"))

如有任何帮助,我们将不胜感激。

此外,按数据 table 方式,我的意思是:

students[,Honours:="N"]

其实很简单

students[, Honours := c("N", "Y")[grepl("Honours", Degree, fixed = TRUE) + 1L]]

您需要做的就是使用一些正则表达式实现函数搜索 "Honours",例如 grepl(这不是真正的表达式,因此您可以使用 fixed = TREU) 然后根据你的发现从 c("N", "Y") 做一个向量子集(一个 TRUE/FALSE 逻辑向量 + 1L 它将把它转换成 [=20= 的向量] 将用于从 c("N", "Y"))

中减去值

或者,如果这太难阅读,您可以使用 ifelse 代替

students[, Honours := ifelse(grepl("Honours", Degree, fixed = TRUE), "Y", "N")]

当然,如果 "Honours" 可以出现在不同的大小写变体中,您可以将 grepl 呼叫切换为 grepl("Honours", Degree, ignore.case = TRUE)


P.S.

我建议坚持使用逻辑向量,因为之后您可以轻松地对其进行操作

例如

students[, Honours := grepl("Honours", Degree, fixed = TRUE)]

现在如果你想 select 只有 "Honours" 的人,你可以

students[(Honours)]
#    Student_ID                           Degree Honours
# 1:      10002       Honours Degree in Commerce    TRUE
# 2:      10003  Bachelor of Laws (with Honours)    TRUE
# 3:      10004 Bachelor of Nursing with Honours    TRUE

或者没有 "Honours"

的人
students[!(Honours)]
#    Student_ID              Degree Honours
# 1:      10001    Bachelor of Laws   FALSE
# 2:      10005 Bachelor of Nursing   FALSE