根据 data.table 中的逻辑字符串匹配将值分配给新列
allocate values to a new column based on logical string matching in data.table
我有一个很大的学生数据集,其中有荣誉学生的非标准命名约定。我需要 create/populate 一个新列,它将 return 一个 Y 或 N 用于基于单词 "Honours"
的字符串匹配
目前我的数据看起来像这样,有超过 200,000 名学生
library(data.table)
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce", "Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"))
我需要添加第三列,以便在我创建新列 'Honours' 数据 table 方式后,它将像这样填充:
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce","Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"),
Honours = c("N","Y", "Y", "Y","N"))
如有任何帮助,我们将不胜感激。
此外,按数据 table 方式,我的意思是:
students[,Honours:="N"]
其实很简单
students[, Honours := c("N", "Y")[grepl("Honours", Degree, fixed = TRUE) + 1L]]
您需要做的就是使用一些正则表达式实现函数搜索 "Honours",例如 grepl
(这不是真正的表达式,因此您可以使用 fixed = TREU
) 然后根据你的发现从 c("N", "Y")
做一个向量子集(一个 TRUE
/FALSE
逻辑向量 + 1L 它将把它转换成 [=20= 的向量] 将用于从 c("N", "Y")
)
中减去值
或者,如果这太难阅读,您可以使用 ifelse
代替
students[, Honours := ifelse(grepl("Honours", Degree, fixed = TRUE), "Y", "N")]
当然,如果 "Honours" 可以出现在不同的大小写变体中,您可以将 grepl
呼叫切换为 grepl("Honours", Degree, ignore.case = TRUE)
P.S.
我建议坚持使用逻辑向量,因为之后您可以轻松地对其进行操作
例如
students[, Honours := grepl("Honours", Degree, fixed = TRUE)]
现在如果你想 select 只有 "Honours" 的人,你可以
students[(Honours)]
# Student_ID Degree Honours
# 1: 10002 Honours Degree in Commerce TRUE
# 2: 10003 Bachelor of Laws (with Honours) TRUE
# 3: 10004 Bachelor of Nursing with Honours TRUE
或者没有 "Honours"
的人
students[!(Honours)]
# Student_ID Degree Honours
# 1: 10001 Bachelor of Laws FALSE
# 2: 10005 Bachelor of Nursing FALSE
我有一个很大的学生数据集,其中有荣誉学生的非标准命名约定。我需要 create/populate 一个新列,它将 return 一个 Y 或 N 用于基于单词 "Honours"
的字符串匹配目前我的数据看起来像这样,有超过 200,000 名学生
library(data.table)
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce", "Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"))
我需要添加第三列,以便在我创建新列 'Honours' 数据 table 方式后,它将像这样填充:
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce","Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"),
Honours = c("N","Y", "Y", "Y","N"))
如有任何帮助,我们将不胜感激。
此外,按数据 table 方式,我的意思是:
students[,Honours:="N"]
其实很简单
students[, Honours := c("N", "Y")[grepl("Honours", Degree, fixed = TRUE) + 1L]]
您需要做的就是使用一些正则表达式实现函数搜索 "Honours",例如 grepl
(这不是真正的表达式,因此您可以使用 fixed = TREU
) 然后根据你的发现从 c("N", "Y")
做一个向量子集(一个 TRUE
/FALSE
逻辑向量 + 1L 它将把它转换成 [=20= 的向量] 将用于从 c("N", "Y")
)
或者,如果这太难阅读,您可以使用 ifelse
代替
students[, Honours := ifelse(grepl("Honours", Degree, fixed = TRUE), "Y", "N")]
当然,如果 "Honours" 可以出现在不同的大小写变体中,您可以将 grepl
呼叫切换为 grepl("Honours", Degree, ignore.case = TRUE)
P.S.
我建议坚持使用逻辑向量,因为之后您可以轻松地对其进行操作
例如
students[, Honours := grepl("Honours", Degree, fixed = TRUE)]
现在如果你想 select 只有 "Honours" 的人,你可以
students[(Honours)]
# Student_ID Degree Honours
# 1: 10002 Honours Degree in Commerce TRUE
# 2: 10003 Bachelor of Laws (with Honours) TRUE
# 3: 10004 Bachelor of Nursing with Honours TRUE
或者没有 "Honours"
的人students[!(Honours)]
# Student_ID Degree Honours
# 1: 10001 Bachelor of Laws FALSE
# 2: 10005 Bachelor of Nursing FALSE