在 R 中用 subset/grepl 查找 table
Lookup table with subset/grepl in R
我正在分析使用抓取工具提取的一组网址和值。虽然我可以从 URL 中提取子字符串,但我真的不想用正则表达式来做这件事——有没有一种简单的方法可以使用 subset/grepl 进行查找 table 式替换不求助于 dplyr(对变量进行条件变异)?
我目前的进程:
test <- data.frame(
url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'),
content = c(1, 2, 3),
subdir = NA
)
test[grepl('testing', test$url), ]$subdir <- 'testing'
test[grepl('evaluating', test$url), ]$subdir <- 'evaluating'
test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'
显然,这有点笨拙并且不能很好地扩展。使用 dplyr,我可以用条件语句做一些事情,比如:
test %<>% tbl_df() %>%
mutate(subdir = ifelse(
grepl('testing', subdir),
'test r',
ifelse(
grepl('evaluating', subdir),
'eval r',
ifelse(
grepl('analyzing', subdir),
'anal r',
NA
))))
但是,再一次,真的很愚蠢,我不想尽可能地产生包依赖性。有什么方法可以通过某种查找来进行基于正则表达式的子集设置 table?
编辑: 一些说明:
- 对于提取子目录,是的,正则表达式最有效;但是,我希望有一个更通用的模式,可以将类似字典的字符串结构与其他任意值匹配。
- 当然,嵌套的
ifelse
很难看并且容易出错——只是想得到一个 dplyr
向上的简单示例。
编辑 2: 我想回过头来 post 我根据 BondedDust 的方法得到的结果。决定练习一些映射和非标准评估:
test <- data.frame(
url = c(
'google.com/testing/duck',
'google.com/testing/dog',
'google.com/testing/cat',
'google.com/evaluating/duck',
'google.com/evaluating/dog',
'google.com/evaluating/cat',
'google.com/analyzing/duck',
'google.com/analyzing/dog',
'google.com/analyzing/cat',
'banana'
),
content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
subdir = NA
)
# List used for key/value lookup, names can be regex
lookup <- c(
"testing" = "Testing is important",
"Eval.*" = 'eval in R',
"analy(z|s)ing" = 'R is fun'
)
# Dumb test for error handling:
# lookup <- c('test', 'hey')
# Defining new lookup function
regexLookup <- function(data, dict, searchColumn, targetColumn, ignore.case = TRUE){
# Basic check—need to separate errors/handling
if(is.null(names(dict)) || is.null(dict[[1]])) {
stop("Not a valid replacement value; use a key/value store for `dict`.")
}
# Non-standard eval for the column names; not sure if I should
# add safetytype/checks for these
searchColumn <- eval(substitute(searchColumn), data)
targetColumn <- deparse(substitute(targetColumn))
# Define find-and-replace utility
findAndReplace <- function (key, val){
data[grepl(key, searchColumn, ignore.case = ignore.case), targetColumn] <- val
data <<- data
}
# Map over the key/value store
mapply(findAndReplace, names(dict), dict)
# Return result, with non-matching rows preserved
return(data)
}
regexLookup(test, lookup, url, subdir, ignore.case = FALSE)
for (target in c('testing','evaluating','analyzing') ) {
test[grepl(target, test$url),'subdir' ] <- target }
test
url content subdir
1 google.com/testing/duck 1 testing
2 google.com/evaluating/dog 2 evaluating
3 google.com/analyzing/cat 3 analyzing
目标矢量可以改为工作区中矢量的名称。
targets <- c('testing','evaluating','analyzing')
for( target in targets ) { ...}
试试这个:
test$subdir<-gsub('.*\/(.*)\/.*','\1',test$url)
我正在分析使用抓取工具提取的一组网址和值。虽然我可以从 URL 中提取子字符串,但我真的不想用正则表达式来做这件事——有没有一种简单的方法可以使用 subset/grepl 进行查找 table 式替换不求助于 dplyr(对变量进行条件变异)?
我目前的进程:
test <- data.frame(
url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'),
content = c(1, 2, 3),
subdir = NA
)
test[grepl('testing', test$url), ]$subdir <- 'testing'
test[grepl('evaluating', test$url), ]$subdir <- 'evaluating'
test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'
显然,这有点笨拙并且不能很好地扩展。使用 dplyr,我可以用条件语句做一些事情,比如:
test %<>% tbl_df() %>%
mutate(subdir = ifelse(
grepl('testing', subdir),
'test r',
ifelse(
grepl('evaluating', subdir),
'eval r',
ifelse(
grepl('analyzing', subdir),
'anal r',
NA
))))
但是,再一次,真的很愚蠢,我不想尽可能地产生包依赖性。有什么方法可以通过某种查找来进行基于正则表达式的子集设置 table?
编辑: 一些说明:
- 对于提取子目录,是的,正则表达式最有效;但是,我希望有一个更通用的模式,可以将类似字典的字符串结构与其他任意值匹配。
- 当然,嵌套的
ifelse
很难看并且容易出错——只是想得到一个dplyr
向上的简单示例。
编辑 2: 我想回过头来 post 我根据 BondedDust 的方法得到的结果。决定练习一些映射和非标准评估:
test <- data.frame(
url = c(
'google.com/testing/duck',
'google.com/testing/dog',
'google.com/testing/cat',
'google.com/evaluating/duck',
'google.com/evaluating/dog',
'google.com/evaluating/cat',
'google.com/analyzing/duck',
'google.com/analyzing/dog',
'google.com/analyzing/cat',
'banana'
),
content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
subdir = NA
)
# List used for key/value lookup, names can be regex
lookup <- c(
"testing" = "Testing is important",
"Eval.*" = 'eval in R',
"analy(z|s)ing" = 'R is fun'
)
# Dumb test for error handling:
# lookup <- c('test', 'hey')
# Defining new lookup function
regexLookup <- function(data, dict, searchColumn, targetColumn, ignore.case = TRUE){
# Basic check—need to separate errors/handling
if(is.null(names(dict)) || is.null(dict[[1]])) {
stop("Not a valid replacement value; use a key/value store for `dict`.")
}
# Non-standard eval for the column names; not sure if I should
# add safetytype/checks for these
searchColumn <- eval(substitute(searchColumn), data)
targetColumn <- deparse(substitute(targetColumn))
# Define find-and-replace utility
findAndReplace <- function (key, val){
data[grepl(key, searchColumn, ignore.case = ignore.case), targetColumn] <- val
data <<- data
}
# Map over the key/value store
mapply(findAndReplace, names(dict), dict)
# Return result, with non-matching rows preserved
return(data)
}
regexLookup(test, lookup, url, subdir, ignore.case = FALSE)
for (target in c('testing','evaluating','analyzing') ) {
test[grepl(target, test$url),'subdir' ] <- target }
test
url content subdir
1 google.com/testing/duck 1 testing
2 google.com/evaluating/dog 2 evaluating
3 google.com/analyzing/cat 3 analyzing
目标矢量可以改为工作区中矢量的名称。
targets <- c('testing','evaluating','analyzing')
for( target in targets ) { ...}
试试这个:
test$subdir<-gsub('.*\/(.*)\/.*','\1',test$url)