在R中的数据框中的任何列中查找部分匹配字符串
Finding partial match strings in any column in a dataframe in R
我有一个数据框;
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status)
vessel type class status
1 a Fishery Vessel NA NA
2 b NA FISHING NA
3 c NA NA Engaged in Fishing
4 d Cargo CARGO Underway
我想对 df 进行子集化以仅包含那些与钓鱼相关的行(即行 1:3),所以这对我来说意味着做类似的事情;
df.sub<-subset(grep("FISH", df) | grep("Fish", df))
但这行不通。我一直在尝试 apply
(例如 question) or partial string matching using grep
(like this 问题),但我似乎无法将它们整合在一起。
感谢您的帮助。我的数据有 10 列和多达 100 万行,所以尽可能避免循环,但也许这是唯一的方法?谢谢!
如果您想使用 apply()
,您可以根据您的字符串 fish
计算索引,然后计算子集。计算 Index
的方法是使用 grepl()
获得与 fish
匹配的那些值的总和。您可以启用 ignore.case = T
以避免大写或小写文本出现问题。当索引大于或等于 1 时,则发生任何匹配,因此您可以创建子集。这里的代码:
#Data
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status,stringsAsFactors = F)
#Subset
#Create an index with apply
df$Index <- apply(df[1:4],1,function(x) sum(grepl('fish',x,ignore.case = T)))
#Filter
df.sub<-subset(df,Index>=1)
输出:
vessel type class status Index
1 a Fishery Vessel NA NA 1
2 b NA FISHING NA 1
3 c NA NA Engaged in Fishing 1
您可以尝试另一种选择
library(dplyr)
library(stringr)
df %>%
filter_all(any_vars(str_detect(., regex("fish", ignore_case =TRUE))))
# vessel type class status
# 1 a Fishery Vessel NA NA
# 2 b NA FISHING NA
# 3 c NA NA Engaged in Fishing
在base R
中,我们可以将矢量化选项与grepl
和Reduce
一起使用
subset(df, Reduce(`|`, lapply(df[-1], grepl, pattern = 'fish', ignore.case = TRUE)))
# vessel type class status
#1 a Fishery Vessel NA NA
#2 b NA FISHING NA
#3 c NA NA Engaged in Fishing
我有一个数据框;
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status)
vessel type class status
1 a Fishery Vessel NA NA
2 b NA FISHING NA
3 c NA NA Engaged in Fishing
4 d Cargo CARGO Underway
我想对 df 进行子集化以仅包含那些与钓鱼相关的行(即行 1:3),所以这对我来说意味着做类似的事情;
df.sub<-subset(grep("FISH", df) | grep("Fish", df))
但这行不通。我一直在尝试 apply
(例如 grep
(like this 问题),但我似乎无法将它们整合在一起。
感谢您的帮助。我的数据有 10 列和多达 100 万行,所以尽可能避免循环,但也许这是唯一的方法?谢谢!
如果您想使用 apply()
,您可以根据您的字符串 fish
计算索引,然后计算子集。计算 Index
的方法是使用 grepl()
获得与 fish
匹配的那些值的总和。您可以启用 ignore.case = T
以避免大写或小写文本出现问题。当索引大于或等于 1 时,则发生任何匹配,因此您可以创建子集。这里的代码:
#Data
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status,stringsAsFactors = F)
#Subset
#Create an index with apply
df$Index <- apply(df[1:4],1,function(x) sum(grepl('fish',x,ignore.case = T)))
#Filter
df.sub<-subset(df,Index>=1)
输出:
vessel type class status Index
1 a Fishery Vessel NA NA 1
2 b NA FISHING NA 1
3 c NA NA Engaged in Fishing 1
您可以尝试另一种选择
library(dplyr)
library(stringr)
df %>%
filter_all(any_vars(str_detect(., regex("fish", ignore_case =TRUE))))
# vessel type class status
# 1 a Fishery Vessel NA NA
# 2 b NA FISHING NA
# 3 c NA NA Engaged in Fishing
在base R
中,我们可以将矢量化选项与grepl
和Reduce
subset(df, Reduce(`|`, lapply(df[-1], grepl, pattern = 'fish', ignore.case = TRUE)))
# vessel type class status
#1 a Fishery Vessel NA NA
#2 b NA FISHING NA
#3 c NA NA Engaged in Fishing