R根据虚拟变量对行进行分类

Question

我有一个 ATM ID 数据集，这些 ID 用虚拟变量编码，告诉我们 ATM 是打开还是关闭。目标是生成一个新列 (type)，根据其 opening/closure 行为对每个 ATM 进行分类。在此数据中，虚拟变量中的 1 告诉我们 ATM 已打开，而 0 告诉我们 ATM 已关闭。这是数据和预期输出。

data <- tribble(
  ~atm_id, ~nov_2019,  ~feb_2020, ~may_2020, ~aug_2020, ~nov_2020, ~type,
  "xx1", 0,  0, 0, 0, 0,  "A", 
  "xx2", 0,  1, 1, 1, 1,  "B",
  "xx3", 0, 0, 1, 1, 1, "B", 
  "xx4", 0, 0, 0, 1, 1, "B",
  "xx5", 0, 1, 0, 1, 1, "C",
  "xx6", 0, 1, 0, 1, 0, "C"
)

我正在尝试 mutate type 变量并对每种类型的 opening/closure 行为进行分类。

类型 A - 在第一个时间段内关闭并保持关闭状态（全为零）的 ATM
B 类 - 在第一个时间段内关闭、最终重新开放并一直保持开放状态的 ATM。
类型 C - 在第一个时间段关闭，最终重新打开，然后在重新打开后再次关闭的 ATM - 即 (0, 1, 0, 1)

month/year 列增加到 2022 年，我们稍后会添加更多数据，因此理想情况下代码可以灵活适应。但是，这三种是 opening/closure 行为的基本类型，我需要使用行操作或其他方法以某种方式捕获它们。

Answer 1

您可以将 c_across 与 case_when 一起使用。

在第一种情况下，如果 nov_2019 和 nov_2020
在第二种情况下，如果有两个不同的连续值（使用 data.table::rleid），则类型为 B。
否则，类型为 C。这可以替换为 n_distinct(data.table::rleid(c_across(nov_2019:nov_2020))) > 2。

library(dplyr)
data %>% 
  rowwise() %>% 
  mutate(new = case_when(all(c_across(nov_2019:nov_2020) == 0) ~ "A",
                         n_distinct(data.table::rleid(c_across(nov_2019:nov_2020))) == 2 ~ "B",
                         T ~ "C"))

# A tibble: 6 x 8
# Rowwise: 
  atm_id nov_2019 feb_2020 may_2020 aug_2020 nov_2020 type  new  
  <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr> <chr>
1 xx1           0        0        0        0        0 A     A    
2 xx2           0        1        1        1        1 B     B    
3 xx3           0        0        1        1        1 B     B    
4 xx4           0        0        0        1        1 B     B    
5 xx5           0        1        0        1        1 C     C    
6 xx6           0        1        0        1        0 C     C

Answer 2

我首先会通过创建一个对分类进行编码的函数来解决这个问题您要对单个向量执行的规则。

library(dplyr, warn.conflicts = FALSE)

classify_atm <- function(is_open) {
  is_open <- as.logical(is_open)
  case_when(
    first(is_open) ~ NA_character_, # Not specified
    # Remained closed
    all(!is_open) ~ "A",
    # Reopened without any further closing
    all(is_open == cummax(is_open)) ~ "B",
    # Reopened, but closed again at some point -- essentially, others
    TRUE ~ "C",
  )  
}

# Test on some input vectors
classify_atm(c(0, 0, 0))
#> [1] "A"
classify_atm(c(0, 1, 1))
#> [1] "B"
classify_atm(c(0, 1, 0))
#> [1] "C"

然后，使用 rowwise() 和 c_across() 形成每一行的输入：

data <- tribble(
  ~atm_id, ~nov_2019,  ~feb_2020, ~may_2020, ~aug_2020, ~nov_2020, ~type,
  "xx1", 0,  0, 0, 0, 0,  "A", 
  "xx2", 0,  1, 1, 1, 1,  "B",
  "xx3", 0, 0, 1, 1, 1, "B", 
  "xx4", 0, 0, 0, 1, 1, "B",
  "xx5", 0, 1, 0, 1, 1, "C",
  "xx6", 0, 1, 0, 1, 0, "C"
)

data %>% 
  rowwise() %>% 
  mutate(
    new_type = classify_atm(c_across(nov_2019:nov_2020))
  )
#> # A tibble: 6 x 8
#> # Rowwise: 
#>   atm_id nov_2019 feb_2020 may_2020 aug_2020 nov_2020 type  new_type
#>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr> <chr>   
#> 1 xx1           0        0        0        0        0 A     A       
#> 2 xx2           0        1        1        1        1 B     B       
#> 3 xx3           0        0        1        1        1 B     B       
#> 4 xx4           0        0        0        1        1 B     B       
#> 5 xx5           0        1        0        1        1 C     C       
#> 6 xx6           0        1        0        1        0 C     C

R根据虚拟变量对行进行分类

R categorize row based on dummy variables

r

dplyr

tidyverse