dplyr：简化创建匹配和绝对差异变量

Question

我有一个关于每个人的友谊和特征的数据集，我正在尝试创建变量，如果它们在二元度量上匹配，以及它们在连续度量上的绝对差异是多少。

我可以很容易地做到这一点，但我想知道是否有比我的方法更精简的不同方法，因为我有大约 60 个变量来做到这一点。

示例数据：

dat <- read.table(text = "id.x  id.y    male.x  smoke.x drink.x everfight.x grades.x    male.y  smoke.y drink.y everfight.y grades.y
1   6   0   2   4   1   3   0   2   1   0   2
2   7   0   2   4   0   5   0   2   3   1   4
3   8   1   4   4   1   2   0   4   2   1   1
4   9   0   2   3   1   2   0   3   2   0   1
5   10  1   2   4   0   4   1   4   1   0   4", header = TRUE)

这是我所做的：

dat <- dat %>%
       mutate(sex_match = case_when(male.x == male.y ~ 1,
                                    TRUE ~ 0),
              fight_match = case_when(everfight.x == everfight.y ~ 1,
                                      TRUE ~ 0),
              smoke_diff   = abs(smoke.x  - smoke.y),
              drink_diff   = abs(drink.x  - drink.y),
              grades_diff  = abs(grades.x - grades.y))

这正是我想要的：

id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y sex_match fight_match smoke_diff drink_diff grades_diff
 1    6      0       2       4           1        3      0       2       1           0        2         1           0          0          3           1
 2    7      0       2       4           0        5      0       2       3           1        4         1           0          0          1           1
 3    8      1       4       4           1        2      0       4       2           1        1         0           1          0          2           1
 4    9      0       2       3           1        2      0       3       2           0        1         1           0          1          1           1
 5   10      1       2       4           0        4      1       4       1           0        4         1           1          2          3           0

但是，我想知道是否有一种方法可以通过循环或应用来识别相应的变量并在上面的示例输出中创建匹配和绝对差异的新变量。

更新

最终使用了 Jon 回答的大部分内容和 akrun 的一部分，以下是最适合我的方法：

non_binary <- dat %>% select(., contains(".x")) %>%
                      select(., -id.x) %>%
                      select_if(~!all(. %in% 0:1)) %>% 
                      rename_with(~str_remove(., '.x')) %>%
                      names()
dat %>%
  pivot_longer(-c(id.x:id.y), 
               names_to = c("var", ".value"),
               names_pattern = "(.+).(.+)") %>%
  mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
  mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
  select(-c(var:y)) %>%
  pivot_wider(names_from = col_name, values_from = match)

谢谢你们！

Answer 1

我们可以将 tidyverse 与 across 一起使用，这可以单独使用 dplyr/stringr 包来实现，即循环 across [=31= 的 .x 列], 'everfight', 然后 get 相应 .y 列的值来创建二进制列，类似地在其他列上执行此操作，并得到 absolute 差异。在.names中，使用str_replace

替换列名

library(dplyr)
library(stringr)
dat %>% 
   mutate(across(c(male.x, everfight.x ),
      ~ +(. == get(str_replace(cur_column(), 'x$', 'y'))),
       .names = "{str_replace(.col, '.x', '_match')}"), 
     across(c(smoke.x, drink.x, grades.x), 
         ~
       abs(. - get(str_replace(cur_column(), 'x$', 'y'))),
           .names = "{str_replace(.col, '.x', '_diff')}"))

-输出

id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y male_match everfight_match smoke_diff drink_diff grades_diff
1    1    6      0       2       4           1        3      0       2       1           0        2          1               0          0          3           1
2    2    7      0       2       4           0        5      0       2       3           1        4          1               0          0          1           1
3    3    8      1       4       4           1        2      0       4       2           1        1          0               1          0          2           1
4    4    9      0       2       3           1        2      0       3       2           0        1          1               0          1          1           1
5    5   10      1       2       4           0        4      1       4       1           0        4          1               1          2          3           0

或者也可以一次性完成 across

dat %>% 
    mutate(across(ends_with('.x'), ~ {
       other <- get(str_replace(cur_column(), 'x$', 'y'))
    if(all(. %in% c(0, 1)) )  +(. == other) else abs(. - other)
       }, .names = "{str_replace(.col, '.x', '_diff')}"))

Answer 2

这是一个 tidyr/dplyr 方法。首先，我重塑为长格式，每个 id/variable 组合有一行，每个版本有列。然后我可以一次比较每对的那些，并重新整形。

library(dplyr); library(tidyr)
non_binary <- c("smoke", "drink", "grades")
dat %>%
  pivot_longer(-c(id.x:id.y), 
               names_to = c("var", ".value"),
               names_pattern = "(.+).(.+)") %>%
  mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
  mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
  select(-c(var:y)) %>%
  pivot_wider(names_from = col_name, values_from = match)

结果，可以附加到原始数据：

# A tibble: 5 x 7
   id.x  id.y male_match smoke_diff drink_diff everfight_match grades_diff
  <int> <int>      <int>      <int>      <int>           <int>       <int>
1     1     6          1          0          3               0           1
2     2     7          1          0          1               0           1
3     3     8          0          0          2               1           1
4     4     9          1          1          1               0           1
5     5    10          1          2          3               1           0

dplyr：简化创建匹配和绝对差异变量

dplyr: streamline creating matching and absolute difference variables

loops

r

data-manipulation

apply

dplyr