如果按行的 NA 的比例低于某个阈值,如何用行替换 NA?
How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?
对于这个有点麻烦的问题深表歉意,但我目前正在进行一项心理健康研究。对于其中一种心理健康筛查工具,有 15 个变量,每个变量的值可以为 0-3。然后通过取这 15 个变量的总和来分配每个 row/participant 的总分。此工具的文档指出,如果特定 row/participant 的值超过 20% 缺失,则总分也应视为缺失,但如果某行的值缺失少于 20% ,应为每个缺失值分配该行剩余值的平均值。
我决定要做到这一点,我必须计算每个参与者的 NA 比例,计算每个参与者的所有 15 个变量的平均值,不包括 NA,然后使用条件变异语句(或类似的东西)检查 NA 的比例是否小于 20%,如果是,则用该行的平均值替换相关列的 NA,然后找到每行的所有 15 个变量的总和。除了这 15 列之外,数据集还包含其他列,因此将函数应用于所有列不会有用。
为了计算没有 NA 的平均分数,我做了以下操作:
mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3",
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12","var13", "var14", "var15")], na.rm=TRUE)
并计算每个变量的 NA 比例:
mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12", "var13", "var14", "var15")]))
但是,当我尝试使用 mutate() 语句来更改只有不到 20% 的值是 NA 的行时,我无法识别任何有效的代码。到目前为止,我已经尝试了很多排列,包括每个变量的以下内容:
mental_recode <- mental %>%
rowwise() %>%
mutate(var1 = if(somatic_na<0.2)
replace_na(list(var1= somatic_mean)))
哪个returns:
"no applicable method for 'replace_na' applied to an object of class "list""
并尝试在不使用 mutate() 的情况下一起完成它们:
mental %>%
rowwise() %>%
if(somatic_na<0.2)
replace_na(list(var1 = somatic_mean, var2=
somatic_mean, var3 = somatic_mean, var4 = somatic_mean, var5 =
somatic_mean, var6 = somatic_mean, var7 = somatic_mean, var8 =
somatic_mean, var9 = somatic_mean, var10 = somatic_mean, var11 =
somatic_mean, var12 = somatic_mean, var13 = somatic_mean, var14 =
somatic_mean, var15 = somatic_mean ))
哪个returns:
Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
the condition has length > 1 and only the first element will be used
我还尝试将 if_else() 与 mutate() 结合使用,并在不满足条件的情况下将值设置为 NA,但在各种排列和错误消息之后也无法使其工作。
编辑:可以通过以下方式生成虚拟数据:
mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L,
NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, 0L), var2 = c(0L,
0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L,
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L,
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L,
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L,
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L,
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id",
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8",
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =
"data.frame", row.names = c(NA,
-21L))
有谁知道适用于这种情况的代码吗?
提前致谢!
这是一种仅使用基本 R
表达式并记住求和和均值的数学属性的方法:
# generate fake data
set.seed(123)
dat <- data.frame(
ID = 1:10,
matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
'another_var' = 'foo',
'second_var' = 'bar',
stringsAsFactors = FALSE
)
var_names <- paste0('X', 1:15)
# add number of NAs to data
dat$na_num <- rowSums(is.na(dat[var_names]))
# add row sum
dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE)
# add row mean
dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)
# add final sum
dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num
# recode final sum to be NA if prop > .2
dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
NA,
dat$final_sum)
这是一个功能相同的函数。在您指定 data
的位置,然后指定变量名称的字符向量。
total_sum_calculation <- function(data, var_names){
# add number of NAs to data
na_num <- rowSums(is.na(data[var_names]))
# add row sum
row_sum <- rowSums(data[var_names], na.rm = TRUE)
# add row mean
row_mean <- rowMeans(data[var_names], na.rm = TRUE)
# add final sum
final_sum <- row_sum + row_mean * na_num
# recode final sum to be NA if prop > .2
ifelse(rowMeans(is.na(data[var_names])) > .2,
NA,
final_sum)
}
v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)
[1] 6.000000 0.000000 8.000000 7.500000 NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14] NA 5.357143 5.357143 5.357143 9.642857 1.071429 NA 3.000000
这是一种使用您提供的数据框 dplyr
在一个链中完成所有操作的方法。
首先创建一个包含所有感兴趣的列名称的向量:
name_col <- colnames(mental)[2:16]
现在使用 dplyr
library(dplyr)
mental %>%
# First create the column of row means
mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>%
# Now calculate the proportion of NAs
mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>%
# Create this column for filtering out later
mutate(somatic_usable = ifelse(somatic_na < 0.2,
"yes", "no")) %>%
# Make the following replacement on a row basis
rowwise() %>%
mutate_at(vars(name_col), # Designate eligible columns to check for NAs
funs(replace(.,
is.na(.) & somatic_na < 0.2, # Both conditions need to be met
somatic_mean))) %>% # What we are subbing the NAs with
ungroup() # Now ungroup the 'rowwise' in case you need to modify further
现在,如果您只想 select NA 少于 20% 的条目,您可以将以上内容通过管道传输到以下内容:
filter(somatic_usable == "yes")
另请注意,如果您想让条件小于 或等于 20%,则需要将两个 somatic_na < 0.2
替换为 somatic_na <= 0.2
.
希望对您有所帮助!
对于这个有点麻烦的问题深表歉意,但我目前正在进行一项心理健康研究。对于其中一种心理健康筛查工具,有 15 个变量,每个变量的值可以为 0-3。然后通过取这 15 个变量的总和来分配每个 row/participant 的总分。此工具的文档指出,如果特定 row/participant 的值超过 20% 缺失,则总分也应视为缺失,但如果某行的值缺失少于 20% ,应为每个缺失值分配该行剩余值的平均值。
我决定要做到这一点,我必须计算每个参与者的 NA 比例,计算每个参与者的所有 15 个变量的平均值,不包括 NA,然后使用条件变异语句(或类似的东西)检查 NA 的比例是否小于 20%,如果是,则用该行的平均值替换相关列的 NA,然后找到每行的所有 15 个变量的总和。除了这 15 列之外,数据集还包含其他列,因此将函数应用于所有列不会有用。
为了计算没有 NA 的平均分数,我做了以下操作:
mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3",
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12","var13", "var14", "var15")], na.rm=TRUE)
并计算每个变量的 NA 比例:
mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12", "var13", "var14", "var15")]))
但是,当我尝试使用 mutate() 语句来更改只有不到 20% 的值是 NA 的行时,我无法识别任何有效的代码。到目前为止,我已经尝试了很多排列,包括每个变量的以下内容:
mental_recode <- mental %>%
rowwise() %>%
mutate(var1 = if(somatic_na<0.2)
replace_na(list(var1= somatic_mean)))
哪个returns:
"no applicable method for 'replace_na' applied to an object of class "list""
并尝试在不使用 mutate() 的情况下一起完成它们:
mental %>%
rowwise() %>%
if(somatic_na<0.2)
replace_na(list(var1 = somatic_mean, var2=
somatic_mean, var3 = somatic_mean, var4 = somatic_mean, var5 =
somatic_mean, var6 = somatic_mean, var7 = somatic_mean, var8 =
somatic_mean, var9 = somatic_mean, var10 = somatic_mean, var11 =
somatic_mean, var12 = somatic_mean, var13 = somatic_mean, var14 =
somatic_mean, var15 = somatic_mean ))
哪个returns:
Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
the condition has length > 1 and only the first element will be used
我还尝试将 if_else() 与 mutate() 结合使用,并在不满足条件的情况下将值设置为 NA,但在各种排列和错误消息之后也无法使其工作。
编辑:可以通过以下方式生成虚拟数据:
mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L,
NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, 0L), var2 = c(0L,
0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L,
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L,
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L,
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L,
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L,
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id",
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8",
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =
"data.frame", row.names = c(NA,
-21L))
有谁知道适用于这种情况的代码吗?
提前致谢!
这是一种仅使用基本 R
表达式并记住求和和均值的数学属性的方法:
# generate fake data
set.seed(123)
dat <- data.frame(
ID = 1:10,
matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
'another_var' = 'foo',
'second_var' = 'bar',
stringsAsFactors = FALSE
)
var_names <- paste0('X', 1:15)
# add number of NAs to data
dat$na_num <- rowSums(is.na(dat[var_names]))
# add row sum
dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE)
# add row mean
dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)
# add final sum
dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num
# recode final sum to be NA if prop > .2
dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
NA,
dat$final_sum)
这是一个功能相同的函数。在您指定 data
的位置,然后指定变量名称的字符向量。
total_sum_calculation <- function(data, var_names){
# add number of NAs to data
na_num <- rowSums(is.na(data[var_names]))
# add row sum
row_sum <- rowSums(data[var_names], na.rm = TRUE)
# add row mean
row_mean <- rowMeans(data[var_names], na.rm = TRUE)
# add final sum
final_sum <- row_sum + row_mean * na_num
# recode final sum to be NA if prop > .2
ifelse(rowMeans(is.na(data[var_names])) > .2,
NA,
final_sum)
}
v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)
[1] 6.000000 0.000000 8.000000 7.500000 NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14] NA 5.357143 5.357143 5.357143 9.642857 1.071429 NA 3.000000
这是一种使用您提供的数据框 dplyr
在一个链中完成所有操作的方法。
首先创建一个包含所有感兴趣的列名称的向量:
name_col <- colnames(mental)[2:16]
现在使用 dplyr
library(dplyr)
mental %>%
# First create the column of row means
mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>%
# Now calculate the proportion of NAs
mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>%
# Create this column for filtering out later
mutate(somatic_usable = ifelse(somatic_na < 0.2,
"yes", "no")) %>%
# Make the following replacement on a row basis
rowwise() %>%
mutate_at(vars(name_col), # Designate eligible columns to check for NAs
funs(replace(.,
is.na(.) & somatic_na < 0.2, # Both conditions need to be met
somatic_mean))) %>% # What we are subbing the NAs with
ungroup() # Now ungroup the 'rowwise' in case you need to modify further
现在,如果您只想 select NA 少于 20% 的条目,您可以将以上内容通过管道传输到以下内容:
filter(somatic_usable == "yes")
另请注意,如果您想让条件小于 或等于 20%,则需要将两个 somatic_na < 0.2
替换为 somatic_na <= 0.2
.
希望对您有所帮助!