如何使用具有 NA 值的 scale() 函数计算 z-score

Question

我有一个包含 98790 obs 的数据框。 143 个变量。它包含数字和 NA。我想为每一行执行 z-score。我尝试了以下方法：

>df
sample1 sample2 sample3 sample4 sample5 sampl6 sample7 sample8
1:     6.96123  3.021311          NA        NA  7.464205   7.902878  -1.194076   7.771018
2:          NA        NA          NA        NA        NA         NA         NA         NA
3:          NA        NA          NA        NA        NA         NA   2.784635         NA
4:          NA        NA    8.342075        NA  8.464205         NA   6.462707   7.118941
5:          NA  7.243703   10.149430        NA        NA   8.317915         NA         NA

并且：

>res <- t(scale(t(df)))

上述函数会忽略所有NA并计算z-score吗？如果不是，我如何在不考虑 NAs 的情况下计算 z 分数？

Answer 1

您可能希望在 transposing/scaling/re-transposing 之前转换为矩阵（数据框 -> 矩阵 -> 转置 -> 缩放 -> 转置 -> 数据框）

否则，似乎工作正常。这是一个包含一些 NA 值的示例：

set.seed(101)
m <- matrix(rnorm(25),5,5)
m[sample(1:25,size=8)] <- NA
m
##            [,1] [,2]       [,3]       [,4]       [,5]
## [1,] -0.3260365   NA  0.5264481 -0.1933380         NA
## [2,]  0.5524619   NA -0.7948444 -0.8497547  0.7085221
## [3,] -0.6749438   NA  1.4277555  0.0584655 -0.2679805
## [4,]  0.2143595   NA -1.4668197 -0.8176704 -1.4639218
## [5,]         NA   NA -0.2366834         NA  0.7444358
scale(m)
##            [,1] [,2]       [,3]       [,4]       [,5]
## [1,] -0.4885685   NA  0.5628440  0.5661203         NA
## [2,]  1.1159619   NA -0.6077977 -0.8785073  0.7475404
## [3,] -1.1258292   NA  1.3613864  1.1202838 -0.1904198
## [4,]  0.4984359   NA -1.2031558 -0.8078967 -1.3391573
## [5,]         NA   NA -0.1132769         NA  0.7820366
## attr(,"scaled:center")
## [1] -0.05853976         NaN -0.10882877 -0.45057439 -0.06973609
## attr(,"scaled:scale")
## [1] 0.5475112 0.0000000 1.1286908 0.4543848 1.0410918

文档 (?scale) 也非常明确地说明了 NA 值的处理方式：

... centering is done by subtracting the column means (omitting ‘NA’s) of ‘x’ from their corresponding columns ...

... the root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values ...

（强调）

如何使用具有 NA 值的 scale() 函数计算 z-score

how to calculate z-score using scale() function with NA values

r

scale