如何以整洁的方式重新编码众多因子变量
How to recode numerous factor variables in a tidy manner
我有许多变量,这些变量本质上是我想重新编码为整数的因素。
多个变量是一个字符串,第一个字符是对应于
整数例如2 = I have considered suicide in the past week, but not made any plans.
应该是 2
。
其他变量是 yes
或 no
,应该分别是 1
或 0
。
其他的,有许多基于多个字符串的级别:
none = 0
one = 1
two = 2
three = 3
four or more = 4
同样:
ptsd = 0
depression = 1
generalised anxiety = 2
no diagnosis warranted = 3
并且:
Female = 0
Male = 1
Other = 2
单元格中的一些值是 NA
,需要保持 NA
。我试过尝试以下代码而不尝试更改所有变量(从简单开始):
vars1 <- vars(pastpsyc, pastmed, hxsuicide)
vars2 <- vars(siss, mssi_1)
df_rc <- df %>%
## this works
mutate_at(vars1, ~ (case_when(
. == "yes" ~ 1,
. == "no" ~ 0
))) %>%
## this does not
mutate_at(vars2, ~as.integer(str_extract(vars2, "[0-9]"))) %>%
## nor does this
mutate_at(diag1, ~ (case_when(
. == "ptsd" ~ 0,
. == "depression" ~ 1,
. == "generalised anxiety" ~ 2,
. == "no diagnosis warranted" ~ 3
)))
但这失败了,我完全不知道如何重新编码其他变量。
如何将不同的字符串更改为我需要的格式(最好以整洁的方式)?下面是一个最小可重现的数据集。
structure(list(siss = c("2 = I have considered suicide in the past week, but not made any plans.",
"1 = I have had vague thoughts of suicide in the past week.",
"2 = I have considered suicide in the past week, but not made any plans.",
"1 = I have had vague thoughts of suicide in the past week.",
"3 = I have made plans to suicide in the past week, but I haven’t intended to act on these plans."
), mssi_1 = c("1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"2. Moderate - current desire to die, may be preoccupied with ideas about death, or intensity seems greater than a rating of 1."
), diag1 = c("ptsd", NA, "depression", "generalised anxiety",
"no diagnosis warranted"), pastpsyc = c("yes", NA, "no", NA,
"yes"), pastmed = c("no", "yes", NA, "no", "no"), hxsuicide = c("yes",
NA, "yes", "yes", "yes"), suicide_attempts = c("none", NA, "one",
"two", "four or more"), sex = c("Male", "Other", NA, "Female",
NA)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), spec = structure(list(cols = list(siss = structure(list(), class = c("collector_character",
"collector")), mssi_1 = structure(list(), class = c("collector_character",
"collector")), diag1 = structure(list(), class = c("collector_character",
"collector")), pastpsyc = structure(list(), class = c("collector_character",
"collector")), pastmed = structure(list(), class = c("collector_character",
"collector")), hxsuicide = structure(list(), class = c("collector_character",
"collector")), suicide_attempts = structure(list(), class = c("collector_character",
"collector")), sex = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
这只需要很小的改动。
- 您想要在
str_extract
中使用点 (.) 而不是 vars2
- 您希望
vars(diag1)
或 "diag1"
(或仅使用 mutate
)更改该单列
希望这对您有所帮助。
df %>%
## this works
mutate_at(vars1, ~ (case_when(
. == "yes" ~ 1,
. == "no" ~ 0
))) %>%
## this does not
mutate_at(vars2, ~as.integer(str_extract(., "[0-9]"))) %>%
## nor does this
mutate_at(vars(diag1), ~ (case_when(
. == "ptsd" ~ 0,
. == "depression" ~ 1,
. == "generalised anxiety" ~ 2,
. == "no diagnosis warranted" ~ 3
)))
如果您想使用 mutate
而不是 mutate_at
:
mutate(diag1 = case_when(
diag1 == "ptsd" ~ 0,
diag1 == "depression" ~ 1,
diag1 == "generalised anxiety" ~ 2,
diag1 == "no diagnosis warranted" ~ 3
))
我有许多变量,这些变量本质上是我想重新编码为整数的因素。
多个变量是一个字符串,第一个字符是对应于
整数例如2 = I have considered suicide in the past week, but not made any plans.
应该是 2
。
其他变量是 yes
或 no
,应该分别是 1
或 0
。
其他的,有许多基于多个字符串的级别:
none = 0
one = 1
two = 2
three = 3
four or more = 4
同样:
ptsd = 0
depression = 1
generalised anxiety = 2
no diagnosis warranted = 3
并且:
Female = 0
Male = 1
Other = 2
单元格中的一些值是 NA
,需要保持 NA
。我试过尝试以下代码而不尝试更改所有变量(从简单开始):
vars1 <- vars(pastpsyc, pastmed, hxsuicide)
vars2 <- vars(siss, mssi_1)
df_rc <- df %>%
## this works
mutate_at(vars1, ~ (case_when(
. == "yes" ~ 1,
. == "no" ~ 0
))) %>%
## this does not
mutate_at(vars2, ~as.integer(str_extract(vars2, "[0-9]"))) %>%
## nor does this
mutate_at(diag1, ~ (case_when(
. == "ptsd" ~ 0,
. == "depression" ~ 1,
. == "generalised anxiety" ~ 2,
. == "no diagnosis warranted" ~ 3
)))
但这失败了,我完全不知道如何重新编码其他变量。
如何将不同的字符串更改为我需要的格式(最好以整洁的方式)?下面是一个最小可重现的数据集。
structure(list(siss = c("2 = I have considered suicide in the past week, but not made any plans.",
"1 = I have had vague thoughts of suicide in the past week.",
"2 = I have considered suicide in the past week, but not made any plans.",
"1 = I have had vague thoughts of suicide in the past week.",
"3 = I have made plans to suicide in the past week, but I haven’t intended to act on these plans."
), mssi_1 = c("1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"1. Weak - unsure about whether he/she wants to die, seldom thinks about death, or intensity seems low.",
"2. Moderate - current desire to die, may be preoccupied with ideas about death, or intensity seems greater than a rating of 1."
), diag1 = c("ptsd", NA, "depression", "generalised anxiety",
"no diagnosis warranted"), pastpsyc = c("yes", NA, "no", NA,
"yes"), pastmed = c("no", "yes", NA, "no", "no"), hxsuicide = c("yes",
NA, "yes", "yes", "yes"), suicide_attempts = c("none", NA, "one",
"two", "four or more"), sex = c("Male", "Other", NA, "Female",
NA)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), spec = structure(list(cols = list(siss = structure(list(), class = c("collector_character",
"collector")), mssi_1 = structure(list(), class = c("collector_character",
"collector")), diag1 = structure(list(), class = c("collector_character",
"collector")), pastpsyc = structure(list(), class = c("collector_character",
"collector")), pastmed = structure(list(), class = c("collector_character",
"collector")), hxsuicide = structure(list(), class = c("collector_character",
"collector")), suicide_attempts = structure(list(), class = c("collector_character",
"collector")), sex = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
这只需要很小的改动。
- 您想要在
str_extract
中使用点 (.) 而不是 - 您希望
vars(diag1)
或"diag1"
(或仅使用mutate
)更改该单列
vars2
希望这对您有所帮助。
df %>%
## this works
mutate_at(vars1, ~ (case_when(
. == "yes" ~ 1,
. == "no" ~ 0
))) %>%
## this does not
mutate_at(vars2, ~as.integer(str_extract(., "[0-9]"))) %>%
## nor does this
mutate_at(vars(diag1), ~ (case_when(
. == "ptsd" ~ 0,
. == "depression" ~ 1,
. == "generalised anxiety" ~ 2,
. == "no diagnosis warranted" ~ 3
)))
如果您想使用 mutate
而不是 mutate_at
:
mutate(diag1 = case_when(
diag1 == "ptsd" ~ 0,
diag1 == "depression" ~ 1,
diag1 == "generalised anxiety" ~ 2,
diag1 == "no diagnosis warranted" ~ 3
))