在 R 中,如何将多个匹配项提取为字符串并在 TRUE 时与 Regex 或 Grep 匹配?
In R, How do you extract multiple matched terms as string and match if TRUE with Regex or Grep?
我仍然是 R 的初学者。我需要一些代码的帮助,这些代码可以在向量中搜索列表中的术语并且 return 正确。如果为 TRUE,return 一串匹配项。
我已将其设置为告诉我术语是否匹配,return 第一个匹配的术语,但我不确定如何获取其余匹配的术语。
在附件代码中,我有我的Desired_Output和不完美的Final_Output。
#create dataset of 2 columns/vectors. 1st column is "Job Title", 2nd column is "Work Experience"
'Work Experience' <- c("cooked food; cleaned house; made beds", "analyzed data; identified gaps; used sql, python, and r", "used tableau to make dashboards for clients; applied advanced macro excel functions", "financial planning and strategy; consulted with leaders and clients")
'Job Title' <- c("dad", "research analyst", "business intelligence consultant", "finance consultant")
Job_Hist <- data.frame(`Job Title`, `Work Experience`)
#create list of terms to search for in Job_Hist
Term_List <- c("python", " r", "sql", "tableau", "excel")
#use grepl to search the Work Experience vector for terms in CS_Term_List THEN return TRUE or FALSE
Term_TF<- grepl(paste(Term_List, collapse = '|'),Job_Hist$Work.Experience)
#add a new column to our final output dataframe that shows if the job experience matched our terms
Final_Output<-Job_Hist
Final_Output$Term_Test <- Term_TF
#Let's see what what terms caused the TRUE Flag in the Final_Output
m<-regexpr(paste(Term_List, collapse = '|'),
Job_Hist$Work.Experience, perl=TRUE)
T_Match <- regmatches(Job_Hist$Work.Experience,m)
#Compare Final_Output to my Desired_Output and please help me :)
Desired_T_Match <- c("NA", "sql, python, r", "tableau, excel", "NA")
Desired_Output <- data.frame(`Job Title`, `Work Experience`, Term_TF, Desired_T_Match)
#I need 2 things.
#1) a way to tie T_Match back to Final_Output... something like if, TRUE then match
#2) a way to return every term matched in a coma delimited string. Example: research analyst analyzed data... TRUE sql, python
您可以使用 stringr::str_extract_all
从每一行中获取匹配列表:
library(stringr)
library(tidyverse)
Job_Hist$matches <- str_extract_all(Job_Hist$Work.Experience,
paste(Term_List, collapse = '|'), simplify = TRUE)
Work.Experience Term matches.1 matches.2
1 cooked food; cleaned house; made beds FALSE
2 analyzed data; identified gaps; used sql, python, and r TRUE sql python
3 used tableau to make dashboards for clients; applied advanced macro excel functions TRUE tableau excel
4 financial planning and strategy; consulted with leaders and clients FALSE
matches.3
1
2 r
3
4
编辑: 如果您希望在一列中将匹配项作为逗号分隔的字符串,您可以使用:
str_extract_all(Job_Hist$Work.Experience, paste(Term_List, collapse = '|')) %>%
sapply(., paste, collapse = ", ")
matches
1
2 sql, python, r
3 tableau, excel
4
请注意,如果您在 str_extract_all
中使用默认参数 simplify = FALSE
,您的列 matches
看起来是正确的,就像我们在上面 sapply
中得到的结果一样。但是,如果您使用 str()
检查,您会看到每个元素实际上都是它自己的列表,这会导致某些类型的分析出现问题。
我仍然是 R 的初学者。我需要一些代码的帮助,这些代码可以在向量中搜索列表中的术语并且 return 正确。如果为 TRUE,return 一串匹配项。
我已将其设置为告诉我术语是否匹配,return 第一个匹配的术语,但我不确定如何获取其余匹配的术语。
在附件代码中,我有我的Desired_Output和不完美的Final_Output。
#create dataset of 2 columns/vectors. 1st column is "Job Title", 2nd column is "Work Experience"
'Work Experience' <- c("cooked food; cleaned house; made beds", "analyzed data; identified gaps; used sql, python, and r", "used tableau to make dashboards for clients; applied advanced macro excel functions", "financial planning and strategy; consulted with leaders and clients")
'Job Title' <- c("dad", "research analyst", "business intelligence consultant", "finance consultant")
Job_Hist <- data.frame(`Job Title`, `Work Experience`)
#create list of terms to search for in Job_Hist
Term_List <- c("python", " r", "sql", "tableau", "excel")
#use grepl to search the Work Experience vector for terms in CS_Term_List THEN return TRUE or FALSE
Term_TF<- grepl(paste(Term_List, collapse = '|'),Job_Hist$Work.Experience)
#add a new column to our final output dataframe that shows if the job experience matched our terms
Final_Output<-Job_Hist
Final_Output$Term_Test <- Term_TF
#Let's see what what terms caused the TRUE Flag in the Final_Output
m<-regexpr(paste(Term_List, collapse = '|'),
Job_Hist$Work.Experience, perl=TRUE)
T_Match <- regmatches(Job_Hist$Work.Experience,m)
#Compare Final_Output to my Desired_Output and please help me :)
Desired_T_Match <- c("NA", "sql, python, r", "tableau, excel", "NA")
Desired_Output <- data.frame(`Job Title`, `Work Experience`, Term_TF, Desired_T_Match)
#I need 2 things.
#1) a way to tie T_Match back to Final_Output... something like if, TRUE then match
#2) a way to return every term matched in a coma delimited string. Example: research analyst analyzed data... TRUE sql, python
您可以使用 stringr::str_extract_all
从每一行中获取匹配列表:
library(stringr)
library(tidyverse)
Job_Hist$matches <- str_extract_all(Job_Hist$Work.Experience,
paste(Term_List, collapse = '|'), simplify = TRUE)
Work.Experience Term matches.1 matches.2
1 cooked food; cleaned house; made beds FALSE
2 analyzed data; identified gaps; used sql, python, and r TRUE sql python
3 used tableau to make dashboards for clients; applied advanced macro excel functions TRUE tableau excel
4 financial planning and strategy; consulted with leaders and clients FALSE
matches.3
1
2 r
3
4
编辑: 如果您希望在一列中将匹配项作为逗号分隔的字符串,您可以使用:
str_extract_all(Job_Hist$Work.Experience, paste(Term_List, collapse = '|')) %>%
sapply(., paste, collapse = ", ")
matches
1
2 sql, python, r
3 tableau, excel
4
请注意,如果您在 str_extract_all
中使用默认参数 simplify = FALSE
,您的列 matches
看起来是正确的,就像我们在上面 sapply
中得到的结果一样。但是,如果您使用 str()
检查,您会看到每个元素实际上都是它自己的列表,这会导致某些类型的分析出现问题。