计算特定列范围内的出现次数和 return 因子变量,R
Count occurrences in specific column ranges and return factor variable, R
我有这样的数据:
df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1,
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1,
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1,
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1,
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1,
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0,
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0,
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0,
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0,
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0,
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0,
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0,
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0,
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0,
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0,
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
最初来自这个 Redcap 输入,其中每个按钮都是这些列之一:
我需要在末尾创建一个列(我们称之为级别),其中包含这些可能的输入:
- 颈椎(任何 c 按钮)
- 胸部(t's)
- 腰椎(L's)
- 骶骨(骶骨)
- 胸颈(t 或 c)
- 胸腰椎(t 或 l)
- 腰骶(l's 和骶骨)
例如,第一行的患者在 levels_incised_2 到 levels_incised_10 中有“1”...这意味着他们在颈椎范围和胸椎范围内都有值。所以那个病人应该得到“Thoracocervical”。
第 2 行的患者在 8 到 15 之间有 1,所以他们只会得到“胸椎”
有谁知道完成此操作的最直接方法吗?
哦,最后一个细节,还有 100 多个其他专栏,所以如果我可以 select/name 这些特定的专栏来计算就更好了
这里有几点需要解决:
- 找到一种方法将
levels...#
转换为 C
/T
/... 类别之一;
- 生成逻辑以根据组的存在进行推断。
我认为第一个可以通过提取数字并使用 findInterval
来确定 C
/T
/... 每一列所属。从那里,我们可以做一些简单的 c_across
来找到组中的“任何”,然后 case_when
来获取你的 Level
标签。
library(dplyr)
# helper function for renaming
func <- function(z) {
num <- as.integer(gsub("\D", "", z))
grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
# fix those that do not need numbering
grp[grepl("^Co", grp)] <- "Co"
grp[grepl("^unc", grp)] <- "unclear"
grp
}
out <- df %>%
rename_with(.cols = starts_with("levels"), .fn = func) %>%
rowwise() %>%
mutate(
anyC = sum(c_across(C1:C7)) > 0,
anyT = sum(c_across(T1:T12)) > 0,
anyL = sum(c_across(L1:L5)) > 0,
anyS = sum(c_across(S1:S5)) > 0
) %>%
ungroup() %>%
mutate(
Level = case_when(
anyC & anyT & anyL ~ "More than 2?",
anyL & anyS ~ "Lumbosacral",
anyT & anyL ~ "Thoracolumbar",
anyT & anyC ~ "Thoracocervical",
anyS ~ "Sacral",
anyL ~ "Lumbar",
anyT ~ "Thoracic",
anyC ~ "Cervical",
TRUE ~ "Nothing?"
)
)
out
# # A tibble: 6 x 36
# C1 C2 C3 C4 C5 C6 C7 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 Co unclear anyC anyT anyL anyS Level
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <chr>
# 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRUE TRUE FALSE FALSE Thoracocervical
# 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 FALSE FALSE FALSE TRUE Sacral
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 FALSE FALSE TRUE TRUE Lumbosacral
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 FALSE FALSE TRUE TRUE Lumbosacral
out$Level
# [1] "Thoracocervical" "Thoracic" "Thoracic" "Sacral" "Lumbosacral" "Lumbosacral"
如果您不想保留重命名,那么您可以将 Level
结果与 cbind(df, Level = out$Level)
.
合并到您的原始框架中
使用包 dplyr
:
## vertebra codes needed later on
vertebra_codes <- c(
paste0('C',1:7), paste0('T',1:12),
paste0('L',1:5), paste0('S',1:5),
'X', ## for Coccyx
'-' ## for unknown
)
df %>%
mutate(
## assuming each row is a case:
case_id = paste0('case_',row_number())
) %>%
## reshape the data from wide to long format:
pivot_longer(
cols = -case_id,
names_to = 'level_incised', values_to = 'is_incised'
) %>%
mutate(
## remove the redundant 'levels_incised__' prefix:
level_incised = gsub('.*_','',level_incised),
## assign the vertebra corresponding to 'level':
vertebra = vertebra_codes[as.integer(level_incised)],
## assign the spine region (e.g.: all lumbal vert. start with 'L'
spine_region = substr(vertebra,1,1)
) %>%
filter(is_incised == 1) %>% ## we're interested in incised vert. only
## remove replicates (more than one vertebra per spine region affected:
distinct(case_id, spine_region) %>%
## do the counts per case:
group_by(case_id) %>%
## string together the affected regions per case:
summarise(incised_regions = paste(spine_region, collapse = ','))
结果:
# A tibble: 6 x 2
case_id incised_regions
<chr> <chr>
1 case_1 C,T
2 case_2 T
3 case_3 T
4 case_4 S,X
5 case_5 L,S,X
6 case_6 L,S
(请注意,原始的 `df` 在整个处理管道中保持不变。但是您可以通过删除 `%>%` 运算符并检查中间步骤来分解管道,或者将它们分配给临时对象。)
额外/为了好玩: example code 到 ggplot
每个患者的脊椎状态(切开或未切开)。
我有这样的数据:
df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1,
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1,
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1,
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1,
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1,
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0,
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0,
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0,
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0,
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0,
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0,
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0,
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0,
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0,
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0,
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
最初来自这个 Redcap 输入,其中每个按钮都是这些列之一:
我需要在末尾创建一个列(我们称之为级别),其中包含这些可能的输入:
- 颈椎(任何 c 按钮)
- 胸部(t's)
- 腰椎(L's)
- 骶骨(骶骨)
- 胸颈(t 或 c)
- 胸腰椎(t 或 l)
- 腰骶(l's 和骶骨)
例如,第一行的患者在 levels_incised_2 到 levels_incised_10 中有“1”...这意味着他们在颈椎范围和胸椎范围内都有值。所以那个病人应该得到“Thoracocervical”。
第 2 行的患者在 8 到 15 之间有 1,所以他们只会得到“胸椎”
有谁知道完成此操作的最直接方法吗?
哦,最后一个细节,还有 100 多个其他专栏,所以如果我可以 select/name 这些特定的专栏来计算就更好了
这里有几点需要解决:
- 找到一种方法将
levels...#
转换为C
/T
/... 类别之一; - 生成逻辑以根据组的存在进行推断。
我认为第一个可以通过提取数字并使用 findInterval
来确定 C
/T
/... 每一列所属。从那里,我们可以做一些简单的 c_across
来找到组中的“任何”,然后 case_when
来获取你的 Level
标签。
library(dplyr)
# helper function for renaming
func <- function(z) {
num <- as.integer(gsub("\D", "", z))
grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
# fix those that do not need numbering
grp[grepl("^Co", grp)] <- "Co"
grp[grepl("^unc", grp)] <- "unclear"
grp
}
out <- df %>%
rename_with(.cols = starts_with("levels"), .fn = func) %>%
rowwise() %>%
mutate(
anyC = sum(c_across(C1:C7)) > 0,
anyT = sum(c_across(T1:T12)) > 0,
anyL = sum(c_across(L1:L5)) > 0,
anyS = sum(c_across(S1:S5)) > 0
) %>%
ungroup() %>%
mutate(
Level = case_when(
anyC & anyT & anyL ~ "More than 2?",
anyL & anyS ~ "Lumbosacral",
anyT & anyL ~ "Thoracolumbar",
anyT & anyC ~ "Thoracocervical",
anyS ~ "Sacral",
anyL ~ "Lumbar",
anyT ~ "Thoracic",
anyC ~ "Cervical",
TRUE ~ "Nothing?"
)
)
out
# # A tibble: 6 x 36
# C1 C2 C3 C4 C5 C6 C7 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 Co unclear anyC anyT anyL anyS Level
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <chr>
# 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRUE TRUE FALSE FALSE Thoracocervical
# 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 FALSE FALSE FALSE TRUE Sacral
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 FALSE FALSE TRUE TRUE Lumbosacral
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 FALSE FALSE TRUE TRUE Lumbosacral
out$Level
# [1] "Thoracocervical" "Thoracic" "Thoracic" "Sacral" "Lumbosacral" "Lumbosacral"
如果您不想保留重命名,那么您可以将 Level
结果与 cbind(df, Level = out$Level)
.
使用包 dplyr
:
## vertebra codes needed later on
vertebra_codes <- c(
paste0('C',1:7), paste0('T',1:12),
paste0('L',1:5), paste0('S',1:5),
'X', ## for Coccyx
'-' ## for unknown
)
df %>%
mutate(
## assuming each row is a case:
case_id = paste0('case_',row_number())
) %>%
## reshape the data from wide to long format:
pivot_longer(
cols = -case_id,
names_to = 'level_incised', values_to = 'is_incised'
) %>%
mutate(
## remove the redundant 'levels_incised__' prefix:
level_incised = gsub('.*_','',level_incised),
## assign the vertebra corresponding to 'level':
vertebra = vertebra_codes[as.integer(level_incised)],
## assign the spine region (e.g.: all lumbal vert. start with 'L'
spine_region = substr(vertebra,1,1)
) %>%
filter(is_incised == 1) %>% ## we're interested in incised vert. only
## remove replicates (more than one vertebra per spine region affected:
distinct(case_id, spine_region) %>%
## do the counts per case:
group_by(case_id) %>%
## string together the affected regions per case:
summarise(incised_regions = paste(spine_region, collapse = ','))
结果:
# A tibble: 6 x 2
case_id incised_regions
<chr> <chr>
1 case_1 C,T
2 case_2 T
3 case_3 T
4 case_4 S,X
5 case_5 L,S,X
6 case_6 L,S
(请注意,原始的 `df` 在整个处理管道中保持不变。但是您可以通过删除 `%>%` 运算符并检查中间步骤来分解管道,或者将它们分配给临时对象。)
额外/为了好玩: example code 到 ggplot
每个患者的脊椎状态(切开或未切开)。