如何使用具有 NA 值的 scale() 函数计算 z-score
how to calculate z-score using scale() function with NA values
我有一个包含 98790 obs 的数据框。 143 个变量。它包含数字和 NA。我想为每一行执行 z-score。我尝试了以下方法:
>df
sample1 sample2 sample3 sample4 sample5 sampl6 sample7 sample8
1: 6.96123 3.021311 NA NA 7.464205 7.902878 -1.194076 7.771018
2: NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA 2.784635 NA
4: NA NA 8.342075 NA 8.464205 NA 6.462707 7.118941
5: NA 7.243703 10.149430 NA NA 8.317915 NA NA
并且:
>res <- t(scale(t(df)))
上述函数会忽略所有NA
并计算z-score吗?如果不是,我如何在不考虑 NA
s 的情况下计算 z 分数?
您可能希望在 transposing/scaling/re-transposing 之前转换为矩阵(数据框 -> 矩阵 -> 转置 -> 缩放 -> 转置 -> 数据框)
否则,似乎工作正常。这是一个包含一些 NA
值的示例:
set.seed(101)
m <- matrix(rnorm(25),5,5)
m[sample(1:25,size=8)] <- NA
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.3260365 NA 0.5264481 -0.1933380 NA
## [2,] 0.5524619 NA -0.7948444 -0.8497547 0.7085221
## [3,] -0.6749438 NA 1.4277555 0.0584655 -0.2679805
## [4,] 0.2143595 NA -1.4668197 -0.8176704 -1.4639218
## [5,] NA NA -0.2366834 NA 0.7444358
scale(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.4885685 NA 0.5628440 0.5661203 NA
## [2,] 1.1159619 NA -0.6077977 -0.8785073 0.7475404
## [3,] -1.1258292 NA 1.3613864 1.1202838 -0.1904198
## [4,] 0.4984359 NA -1.2031558 -0.8078967 -1.3391573
## [5,] NA NA -0.1132769 NA 0.7820366
## attr(,"scaled:center")
## [1] -0.05853976 NaN -0.10882877 -0.45057439 -0.06973609
## attr(,"scaled:scale")
## [1] 0.5475112 0.0000000 1.1286908 0.4543848 1.0410918
文档 (?scale
) 也非常明确地说明了 NA 值的处理方式:
... centering is done by subtracting the
column means (omitting ‘NA’s) of ‘x’ from their corresponding
columns ...
... the root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values ...
(强调)
我有一个包含 98790 obs 的数据框。 143 个变量。它包含数字和 NA。我想为每一行执行 z-score。我尝试了以下方法:
>df
sample1 sample2 sample3 sample4 sample5 sampl6 sample7 sample8
1: 6.96123 3.021311 NA NA 7.464205 7.902878 -1.194076 7.771018
2: NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA 2.784635 NA
4: NA NA 8.342075 NA 8.464205 NA 6.462707 7.118941
5: NA 7.243703 10.149430 NA NA 8.317915 NA NA
并且:
>res <- t(scale(t(df)))
上述函数会忽略所有NA
并计算z-score吗?如果不是,我如何在不考虑 NA
s 的情况下计算 z 分数?
您可能希望在 transposing/scaling/re-transposing 之前转换为矩阵(数据框 -> 矩阵 -> 转置 -> 缩放 -> 转置 -> 数据框)
否则,似乎工作正常。这是一个包含一些 NA
值的示例:
set.seed(101)
m <- matrix(rnorm(25),5,5)
m[sample(1:25,size=8)] <- NA
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.3260365 NA 0.5264481 -0.1933380 NA
## [2,] 0.5524619 NA -0.7948444 -0.8497547 0.7085221
## [3,] -0.6749438 NA 1.4277555 0.0584655 -0.2679805
## [4,] 0.2143595 NA -1.4668197 -0.8176704 -1.4639218
## [5,] NA NA -0.2366834 NA 0.7444358
scale(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.4885685 NA 0.5628440 0.5661203 NA
## [2,] 1.1159619 NA -0.6077977 -0.8785073 0.7475404
## [3,] -1.1258292 NA 1.3613864 1.1202838 -0.1904198
## [4,] 0.4984359 NA -1.2031558 -0.8078967 -1.3391573
## [5,] NA NA -0.1132769 NA 0.7820366
## attr(,"scaled:center")
## [1] -0.05853976 NaN -0.10882877 -0.45057439 -0.06973609
## attr(,"scaled:scale")
## [1] 0.5475112 0.0000000 1.1286908 0.4543848 1.0410918
文档 (?scale
) 也非常明确地说明了 NA 值的处理方式:
... centering is done by subtracting the column means (omitting ‘NA’s) of ‘x’ from their corresponding columns ...
... the root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values ...
(强调)