基于同一数据框中列名的部分匹配的子集列

Question

我想了解如何通过将列名称的前 5 个字母相互匹配来对同一数据框中的多个列进行子集化，如果它们相等，则对其进行子集化并将其存储在新变量中。

这里是我需要的输出的一个小解释。如下所述，

假设数据框是eatable

fruits_area   fruits_production  vegetable_area   vegetable_production 

12             100                26               324
33             250                40               580
66             510                43               581

eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
          "vegetable_production")

我试图编写一个函数来匹配循环中的字符串，并在匹配列名的前 5 个字母后存储子集列。

checkExpression <- function(dataset,str){
    dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}

checkExpression(eatable,"your_string")

上面的函数正确地检查了字符串，但我对如何在数据集中的列名之间进行匹配感到困惑。

编辑：- 我认为这里可以使用正则表达式。

Answer 1

你可以试试：

v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])

或使用map() + select_()

library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))

给出：

#[[1]]
#  fruits_area fruits_production
#1          12               100
#2          33               250
#3         660               510
#
#[[2]]
#  vegetables_area vegetable_production
#1              26                  324
#2              40                  580
#3              43                  581

要不要做成函数：

checkExpression <- function(df, l = 5) {
  v <- unique(substr(names(df), 0, l))
  lapply(v, function(x) df[grepl(x, names(df))])
}

然后只需使用：

checkExpression(eatable, 5)

Answer 2

我相信这可以满足您的需求：

checkExpression <- function(dataset,str){
  cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
  subset(dataset,select=colnames(dataset)[cols])
}

请注意 "^" 添加到 grepl 中使用的模式。

使用您的数据：

checkExpression(eatable,"fruit")
##  fruits_area fruits_production
##1          12               100
##2          33               250
##3         660               510
checkExpression(eatable,"veget")
##  vegetables_area vegetable_production
##1              26                  324
##2              40                  580
##3              43                  581

Answer 3

你的函数完全符合你的要求，但有一个小错误：

checkExpression <- function(dataset,str){
  dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}

将子集化对象的名称从 obje 更改为 dataset。

checkExpression(eatable,"fr")
#  fruits_area fruits_production
#1          12               100
#2          33               250
#3         660               510

checkExpression(eatable,"veg")
#  vegetables_area vegetable_production
#1              26                  324
#2              40                  580
#3              43                  581

基于同一数据框中列名的部分匹配的子集列

Subset Columns based on partial matching of column names in the same data frame

regex

if-statement

r

string-parsing