计算特定列范围内的出现次数和 return 因子变量，R

Question

我有这样的数据：

df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1, 
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1, 
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1, 
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1, 
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1, 
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0, 
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0, 
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0, 
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0, 
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0, 
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0, 
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0, 
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0, 
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0, 
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0, 
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

最初来自这个 Redcap 输入，其中每个按钮都是这些列之一：

我需要在末尾创建一个列（我们称之为级别），其中包含这些可能的输入：

颈椎（任何 c 按钮）
胸部（t's）
腰椎（L's）
骶骨（骶骨）
胸颈（t 或 c）
胸腰椎（t 或 l）
腰骶（l's 和骶骨）

例如，第一行的患者在 levels_incised_2 到 levels_incised_10 中有“1”...这意味着他们在颈椎范围和胸椎范围内都有值。所以那个病人应该得到“Thoracocervical”。

第 2 行的患者在 8 到 15 之间有 1，所以他们只会得到“胸椎”

有谁知道完成此操作的最直接方法吗？

哦，最后一个细节，还有 100 多个其他专栏，所以如果我可以 select/name 这些特定的专栏来计算就更好了

Answer 1

这里有几点需要解决：

找到一种方法将 levels...# 转换为 C/T/... 类别之一；
生成逻辑以根据组的存在进行推断。

我认为第一个可以通过提取数字并使用 findInterval 来确定 C/T/... 每一列所属。从那里，我们可以做一些简单的 c_across 来找到组中的“任何”，然后 case_when 来获取你的 Level 标签。

library(dplyr)

# helper function for renaming
func <- function(z) {
  num <- as.integer(gsub("\D", "", z))
  grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
  grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
  # fix those that do not need numbering
  grp[grepl("^Co", grp)] <- "Co"
  grp[grepl("^unc", grp)] <- "unclear"
  grp
}

out <- df %>%
  rename_with(.cols = starts_with("levels"), .fn = func) %>%
  rowwise() %>%
  mutate(
    anyC = sum(c_across(C1:C7)) > 0, 
    anyT = sum(c_across(T1:T12)) > 0, 
    anyL = sum(c_across(L1:L5)) > 0, 
    anyS = sum(c_across(S1:S5)) > 0
  ) %>%
  ungroup() %>%
  mutate(
    Level = case_when(
      anyC & anyT & anyL ~ "More than 2?", 
      anyL & anyS ~ "Lumbosacral", 
      anyT & anyL ~ "Thoracolumbar", 
      anyT & anyC ~ "Thoracocervical", 
      anyS ~ "Sacral", 
      anyL ~ "Lumbar", 
      anyT ~ "Thoracic", 
      anyC ~ "Cervical", 
      TRUE ~ "Nothing?"
    )
  )
out
# # A tibble: 6 x 36
#      C1    C2    C3    C4    C5    C6    C7    T1    T2    T3    T4    T5    T6    T7    T8    T9   T10   T11   T12    L1    L2    L3    L4    L5    S1    S2    S3    S4    S5    Co unclear anyC  anyT  anyL  anyS  Level          
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <lgl> <lgl> <lgl> <lgl> <chr>          
# 1     0     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 TRUE  TRUE  FALSE FALSE Thoracocervical
# 2     0     0     0     0     0     0     0     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 FALSE TRUE  FALSE FALSE Thoracic       
# 3     0     0     0     0     0     0     0     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 FALSE TRUE  FALSE FALSE Thoracic       
# 4     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1       0 FALSE FALSE FALSE TRUE  Sacral         
# 5     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1     1     1     1     1     1       0 FALSE FALSE TRUE  TRUE  Lumbosacral    
# 6     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1     1     1     0     0       0 FALSE FALSE TRUE  TRUE  Lumbosacral    
out$Level
# [1] "Thoracocervical" "Thoracic"        "Thoracic"        "Sacral"          "Lumbosacral"     "Lumbosacral"

如果您不想保留重命名，那么您可以将 Level 结果与 cbind(df, Level = out$Level).

合并到您的原始框架中

Answer 2

使用包 dplyr:

## vertebra codes needed later on
vertebra_codes <- c(
    paste0('C',1:7), paste0('T',1:12),
    paste0('L',1:5), paste0('S',1:5),
    'X', ## for Coccyx
    '-' ## for unknown
)


df %>%
    mutate(
        ## assuming each row is a case:
        case_id = paste0('case_',row_number())
    ) %>% 
    ## reshape the data from wide to long format:
    pivot_longer(
        cols = -case_id,
        names_to = 'level_incised', values_to = 'is_incised'
    ) %>%
    mutate(
        ## remove the redundant 'levels_incised__' prefix:
        level_incised = gsub('.*_','',level_incised),
        ## assign the vertebra corresponding to 'level':
        vertebra = vertebra_codes[as.integer(level_incised)],
        ## assign the spine region (e.g.: all lumbal vert. start with 'L'
        spine_region = substr(vertebra,1,1)
    ) %>%
    filter(is_incised == 1) %>% ## we're interested in incised vert. only
    ## remove replicates (more than one vertebra per spine region affected:
    distinct(case_id, spine_region) %>%
    ## do the counts per case:
    group_by(case_id) %>%
    ## string together the affected regions per case:
    summarise(incised_regions = paste(spine_region, collapse = ','))

结果：

# A tibble: 6 x 2
  case_id incised_regions
  <chr>   <chr>          
1 case_1  C,T            
2 case_2  T              
3 case_3  T              
4 case_4  S,X            
5 case_5  L,S,X          
6 case_6  L,S

（请注意，原始的 `df` 在整个处理管道中保持不变。但是您可以通过删除 `%>%` 运算符并检查中间步骤来分解管道，或者将它们分配给临时对象。）

额外/为了好玩： example code 到 ggplot 每个患者的脊椎状态（切开或未切开）。

计算特定列范围内的出现次数和 return 因子变量，R

Count occurrences in specific column ranges and return factor variable, R

r

tidyverse