如何在不将其强制转换为字符向量的情况下处理数据框的行？

Question

我有这个数据框：

df <- data.frame(
  a = c(0, 1, 0, 1),
  b = c("a", "b", "c", "d")
)
#   a b
# 1 0 a
# 2 1 b
# 3 0 c
# 4 1 d

假设我想测试每一行的条件，return "ok" 或 "not ok"。这应该有效：

apply(df, 1, function(row){
    if (is.numeric(row[1]) & row[2] != "b") {
        "ok"
    } else {
        "not ok"
    }
})
# I should return: "ok" "not ok" "ok" "ok"

不幸的是apply将数据框强制为单一类型，所以一切都被视为一个字符，所以这是我得到的输出：

# "not ok" "not ok" "not ok" "not ok"

有没有办法遍历保留数据类型的数据帧的行？也许使用 dplyr::do 或 purrr::map?

更新

我知道示例中的条件没有多大意义，但我试图简化更复杂的条件。我想避免使用嵌套的 ifelse 语句，因为它们可读性不强。

Answer 1

评论中提出了 ifelse() 的解决方案，这当然适合您的情况：

df$c <- ifelse(is.numeric(df$a) & df$b != "b", "ok", "not ok")
 df
##   a b      c
## 1 0 a     ok
## 2 1 b not ok
## 3 0 c     ok
## 4 1 d     ok

但是您更普遍的问题是如何在数据框的行上应用函数而不将其转换为矩阵。一种可能的方法是在行索引上使用 lapply（或其他方法之一）：

df$c <- vapply(1:nrow(df), function(i){
             if (is.numeric(df[i, 1]) & df[i, 2] != "b") {
               "ok"
             } else {
               "not ok"
             }
           }, character(1))
##  df
##   a b      c
## 1 0 a     ok
## 2 1 b not ok
## 3 0 c     ok
## 4 1 d     ok

同样，在您的情况下，ifelse() 就可以了。但是如果你想对数据框的行做一些更复杂的事情，应用行索引可能是可行的方法。

Answer 2

这个答案的前半部分正在扩展并试图解释@Joran 的出色 comment/answer，这主要是对我和我的理解的练习，但希望它也能帮助其他人。（我很高兴我的理解得到纠正）。

后半部分展示了一些其他非基础解决方案，可用于更复杂的情况。

乔兰的回答

c('not ok','ok')[(is.numeric(df[[1]]) & (df[[2]] != 'b')) + 1]

来自 ?data.frame

A data frame is a list of variables

因此，data.frame 中的每个 column/variable 都是一个列表

从 ?[ 和关于 the difference between [ and [[ 的这个问题我们注意到

For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.

因此，在此解决方案中使用 [[ select 列表的单个元素

df[[1]]    ## select the 1st column as a single element (which is a vector)
# [1] 0 1 0 1
df[[2]]    ## select the 2nd column as a single element (which is a vector)
# [1] a b c d 

## note that df[1] would return the first column as a data.frame (which is a list), not a vector
## we can see that by 
# > str(df[1])
# 'data.frame': 4 obs. of  1 variable:
#   $ a: num  0 1 0 1
# > str(df[[1]])
# num [1:4] 0 1 0 1

现在有了这两个向量 selected，我们可以对其中的每个元素执行向量化逻辑检查

is.numeric(df[[1]]) & (df[[2]] != 'b')
# TRUE FALSE TRUE TRUE

从?logical我们有

...with TRUE being mapped to 1L, FALSE to 0L...

所以本质上是TRUE == 1L和FALSE == 0L，我们可以通过

看到

sum(c(TRUE, TRUE, FALSE, TRUE))
# [1] 3

现在，获取我们选择的向量

c("not ok", "ok")
# [1] "not ok" "ok"

我们可以再次使用 [ 来 select 每个元素

c("not ok", "ok")[1]
# [1] "not ok"
c("not ok", "ok")[2]
# [1] "ok"
c("not ok", "ok")[3]
# [1] NA
## Because there isn't a 3rd element
c("not ok", "ok")[0]
# character(0)    ## empty
## and we can use a vector to select each element
c("not ok", "ok")[c(1,2,1,3)]
# [1] "not ok" "ok"     "not ok" NA

这也意味着我们可以使用之前的逻辑比较来对选项进行子集化。然而，由于 FALSE 被映射到 0L，我们需要给它加 1 以便它能够从 vector

select

c(TRUE, TRUE, FALSE, TRUE) + 1
# [1] 2 2 1 2

这给出了

c("not ok", "ok")[c(2,2,1,2)]
# [1] "ok"     "ok"     "not ok" "ok"

它现在为我们提供了我们想要包含在原始 data.frame

中的信息

df$c <- c("not ok", "ok")[c(2,2,1,2)]
# a b      c
# 1 0 a     ok
# 2 1 b     ok
# 3 0 c not ok
# 4 1 d     ok

非基础解决方案

## a dplyr version, still using ifelse construct
library(dplyr)
df %>%
  mutate(c = ifelse(is.numeric(a) & b != "b", "ok", "not ok")) 

## a couiple of data.table versions using by reference udpates (:=)
library(data.table)
## using an ifelse
setDT(df)[, c := ifelse(is.numeric(a) & b != "b", "ok", "not ok")]

## using filters in i
setDT(df)[is.numeric(a) & b != "b", c := "ok"][is.na(c), c := "not ok"]

如何在不将其强制转换为字符向量的情况下处理数据框的行？

How to work with the rows of a data frame without coercing it into a character vector?

r

apply

dataframe

乔兰的回答

非基础解决方案