R：什么是重新编码变量的有效方法？我如何按比例分配手段？

Question

我想知道是否有人可以指出我将如何使用相同规则重新编码多个变量的方向。我有以下 df bhs1:

structure(list(bhs1_1 = c(NA, 1, NA, 2, 1, 2), bhs1_2 = c(NA, 
2, NA, 2, 1, 1), bhs1_3 = c(NA, 1, NA, 2, 2, 2), bhs1_4 = c(NA, 
2, NA, 1, 1, 1), bhs1_5 = c(NA, 1, NA, 1, 2, 2), bhs1_6 = c(NA, 
1, NA, 2, 1, 2), bhs1_7 = c(NA, 1, NA, 1, 2, 1), bhs1_8 = c(NA, 
2, NA, 2, 2, 2), bhs1_9 = c(NA, 1, NA, 2, 1, 1), bhs1_10 = c(NA, 
2, NA, 1, 2, 2), bhs1_11 = c(NA, 2, NA, 2, 2, 1), bhs1_12 = c(NA, 
2, NA, 2, 1, 1), bhs1_13 = c(NA, 1, NA, 1, 2, 2), bhs1_14 = c(NA, 
2, NA, 2, 1, 1), bhs1_15 = c(NA, 1, NA, 2, 2, 2), bhs1_16 = c(NA, 
2, NA, 2, 2, 2), bhs1_17 = c(NA, 2, NA, 2, 2, 1), bhs1_18 = c(NA, 
1, NA, 1, 2, 1), bhs1_19 = c(NA, 1, NA, 2, 1, 2), bhs1_20 = c(NA, 
2, NA, 2, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

有两个转换规则，针对一半的数据集，例如：

(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17, 
bhs1_18, bhs1_20) 
(if_else(1, 1, 0))

and 

(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13, 
bhs1_15, bhs1_19)
(if_else(2, 1, 0))

有没有一种优雅的方式来编写代码来满足这个用例？如果是这样，有人可以给我指出正确的方向吗and/or 给我提供一个样本？

Answer 1

这是使用 dplyr

的解决方案

library(dplyr)
case1 <- vars(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17, 
  bhs1_18, bhs1_20) 
case2 <- vars(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13, 
  bhs1_15, bhs1_19)
result <- df %>%
  mutate_at(case1, ~ (. == 1) * 1L) %>%
  mutate_at(case2, ~ (. == 2) * 1L)

注意 - 我跳过了 ifelse 语句 - 我只是在测试你的情况，然后通过乘以 1 将 TRUE/FALSE 响应转换为数字。我也不确定您希望如何处理 NA，但这是忽略它们。

如果您不熟悉管道运算符 (%>%)，它会获取前一个函数的结果，并将其设置为下一个函数的第一个参数。它旨在通过避免大量函数嵌套来提高代码的易读性。

Answer 2

我们可以创建感兴趣的列名，然后从逻辑表达式

转换为二进制 (as.integer)

case1 <- c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12", 
   "bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20") 

case2 <-  c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", 
   "bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")
library(magrittr)
df1 %<>%
    mutate_at(vars(case1), funs(as.integer(.==1 ))) %<>%
    mutate_at(vars(case2), funs(as.integer(.==2)))

df1
# A tibble: 6 x 20
#  bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7 bhs1_8 bhs1_9 bhs1_10
#   <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
#1     NA     NA     NA     NA     NA     NA     NA     NA     NA      NA
#2      0      0      0      0      0      0      1      1      1       1
#3     NA     NA     NA     NA     NA     NA     NA     NA     NA      NA
#4      1      0      1      1      0      1      1      1      0       0
#5      0      1      1      1      1      0      0      1      1       1
#6      1      1      1      1      1      1      1      1      1       1
# ... with 10 more variables: bhs1_11 <int>, bhs1_12 <int>, bhs1_13 <int>,
#   bhs1_14 <int>, bhs1_15 <int>, bhs1_16 <int>, bhs1_17 <int>, bhs1_18 <int>,
#   bhs1_19 <int>, bhs1_20 <int>

或者一个有效的选择是使用 data.table

library(data.table)
setDT(df1)[, (case1) := lapply(.SD, function(x) as.integer(x == 1 )),
  .SDcols = case1
      ][, (case2) := lapply(.SD, function(x) as.integer(x == 2)), 
  .SDcols = case2][]

注意这并不假设所有值都相同

Answer 3

您可以使用非常快速的基本 R 方法来执行此操作，如下所示：

case1=c("bhs1_10", "bhs1_11", "bhs1_12", "bhs1_13", "bhs1_14", "bhs1_15","bhs1_16", "bhs1_17", "bhs1_18", "bhs1_19", "bhs1_20")  

case2=c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")

dat[case1]=abs(dat[case1]-2)
dat[case2]=dat[case2]-1

Answer 4

考虑到 OP 希望根据指定规则转换 NA，一个简单的 ifelse 可能会有所帮助：

case1 = c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12",
          "bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20")

case2 = c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10",
          "bhs1_13", "bhs1_15", "bhs1_19")


df[case1] = ifelse(!is.na(df[case1]) & df[case1]==1,1,0)
df[case2] = ifelse(!is.na(df[case2]) & df[case2]==2,1,0)

#Test solution
df[1:7]
#   bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1      0      0      0      0      0      0      0
# 2      0      0      0      0      0      0      1
# 3      0      0      0      0      0      0      0
# 4      1      0      1      1      0      1      1
# 5      0      1      1      1      1      0      0
# 6      1      1      1      1      1      1      1

**更新：**如果 NA 保持原样，则解决方案可以是：

df[case1] = ifelse(df[case1]==1,1,0)
df[case2] = ifelse(df[case2]==2,1,0)


df[1:7]
#   bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1     NA     NA     NA     NA     NA     NA     NA
# 2      0      0      0      0      0      0      1
# 3     NA     NA     NA     NA     NA     NA     NA
# 4      1      0      1      1      0      1      1
# 5      0      1      1      1      1      0      0
# 6      1      1      1      1      1      1      1

R：什么是重新编码变量的有效方法？我如何按比例分配手段？

R: What is an efficient way to recode variables? How do I prorate means?

statistics

r

recode