在R中的数据框中的任何列中查找部分匹配字符串

Finding partial match strings in any column in a dataframe in R

我有一个数据框;

vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status)

vessel           type   class             status
1      a Fishery Vessel      NA                 NA
2      b             NA FISHING                 NA
3      c             NA      NA Engaged in Fishing
4      d          Cargo   CARGO           Underway

我想对 df 进行子集化以仅包含那些与钓鱼相关的行(即行 1:3),所以这对我来说意味着做类似的事情;

df.sub<-subset(grep("FISH", df) | grep("Fish", df))

但这行不通。我一直在尝试 apply(例如 question) or partial string matching using grep (like this 问题),但我似乎无法将它们整合在一起。

感谢您的帮助。我的数据有 10 列和多达 100 万行,所以尽可能避免循环,但也许这是唯一的方法?谢谢!

如果您想使用 apply(),您可以根据您的字符串 fish 计算索引,然后计算子集。计算 Index 的方法是使用 grepl() 获得与 fish 匹配的那些值的总和。您可以启用 ignore.case = T 以避免大写或小写文本出现问题。当索引大于或等于 1 时,则发生任何匹配,因此您可以创建子集。这里的代码:

#Data
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status,stringsAsFactors = F)
#Subset
#Create an index with apply
df$Index <- apply(df[1:4],1,function(x) sum(grepl('fish',x,ignore.case = T)))
#Filter
df.sub<-subset(df,Index>=1)

输出:

  vessel           type   class             status Index
1      a Fishery Vessel      NA                 NA     1
2      b             NA FISHING                 NA     1
3      c             NA      NA Engaged in Fishing     1

您可以尝试另一种选择

library(dplyr)
library(stringr)
df %>% 
  filter_all(any_vars(str_detect(., regex("fish", ignore_case =TRUE))))
#   vessel           type   class             status
# 1      a Fishery Vessel      NA                 NA
# 2      b             NA FISHING                 NA
# 3      c             NA      NA Engaged in Fishing

base R中,我们可以将矢量化选项与greplReduce

一起使用
subset(df, Reduce(`|`, lapply(df[-1], grepl, pattern = 'fish', ignore.case = TRUE)))
#  vessel           type   class             status
#1      a Fishery Vessel      NA                 NA
#2      b             NA FISHING                 NA
#3      c             NA      NA Engaged in Fishing