为数据帧列表中的每个数据帧按列计算列中值绝对偏差 (MAD)

Question

我想按列计算中值绝对偏差 (mscore)，忽略数据帧列表中每个数据帧的第一列。然后将结果作为新行添加到行名称为 mscore 的数据框中。

以前我会一次对每个数据帧进行计算，但现在它简化了流程。

下面是我的数据框列表的一小部分。 dfs 的完整列表有超过 30 个数据帧


   list(Al2O3 = structure(list(Determination_No = 1:6, `2` = c(2.01, 
    2.02, 2.03, 2.01, 2.02, 2), `3` = c(2.01, 2.01, 2, 2.02, 2.02, 
    2.03), `4` = c(2, 2.03, 1.99, 2.01, 2.01, 2.01), `5` = c(2.02, 
    2.02, 2.05, 2.03, 2.02, 2.03), `7` = c(1.88, 1.9, 1.89, 1.88, 
    1.88, 1.87), `8` = c(2.053, 2.044, 2.041, 2.038, 2.008, 2.02), 
    `10` = c(2.002830415, 2.021725042, 2.021725042, 1.983935789, 
    2.002830415, 2.021725042), `12` = c(2.09, 2.05, 1.96, 2.09, 
    2.06, 2.02)), class = "data.frame", row.names = c(NA, -6L
    )), As = structure(list(Determination_No = 1:6, `2` = c(0.052, 
    0.027, 0.011, 0.011, 0.012, 0.012), `3` = c(0.012, 0.012, 0.013, 
    0.012, 0.013, 0.013), `4` = c(0.012, 0.012, 0.013, 0.012, 0.012, 
    0.012), `5` = c(0.013, 0.013, 0.013, 0.013, 0.013, 0.013), `7` = c(0.011, 
    0.011, 0.011, 0.012, 0.011, 0.011), `8` = c(0.011, 0.01, 0.011, 
    0.011, 0.011, 0.011), `10` = c(0.01, 0.01, 0.01, 0.01, 0.01, 
    0.01), `12` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_)), class = "data.frame", row.names = c(NA, -6L)), Fe = structure(list(
    Determination_No = 1:6, `2` = c(55.94, 55.7, 56.59, 56.5, 
    55.98, 55.93), `3` = c(56.83, 56.54, 56.18, 56.5, 56.51, 
    56.34), `4` = c(56.39, 56.43, 56.53, 56.31, 56.47, 56.35), 
    `5` = c(56.32, 56.29, 56.31, 56.32, 56.39, 56.32), `7` = c(56.48, 
    56.4, 56.54, 56.43, 56.73, 56.62), `8` = c(56.382, 56.258, 
    56.442, 56.258, 56.532, 56.264), `10` = c(56.3, 56.5, 56.2, 
    56.5, 56.7, 56.5), `12` = c(56.11, 56.46, 56.1, 56.35, 56.36, 
    56.37)), class = "data.frame", row.names = c(NA, -6L)))

以前我会做以下事情

#create a modified scores function to accept NAs

scores_na <- function(x, ...) { 
  not_na <- !is.na(x)
  scores <- rep(NA, length(x))
  scores[not_na] <- outliers::scores(na.omit(x), ...)
  scores
}

MscoreMax <- 3.0 # the the threshold to remove values deemed to be an outlier

 colmedians <- median, df[-1], na.rm = T)
 MScore <- as.vector(round(abs(scores_na(colmedians, "mad")), digits = 2)) #Mscore to 2 decimals 
places
 MscoreIndex <- which(MScore > MscoreMax) #get the index of each value exceeding the threshold
 df[-1][Fe.MscoreIndex] <- NA # change outliers to NA so they are excluded from further calculations

我已经尝试使用下面的行来计算中位数

colmedians 函数用于矩阵，因此我使用 mapply 跨列应用

df <- lapply(df, function(x) rbind(x[,-1], 
                                   mapply(median(x[,-1],na.rm = TRUE))))

但是我得到了跟随错误

    Error in median.default(x[, -1], na.rm = TRUE) : need numeric data

当我查询数据帧时，我的值存储为双精度值，所以有点卡住了。

Answer 1

尝试以下方法

lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE)); 
                  x <- x[,-1]; 
                  rownames(x)[nrow(x)] <- "mscores"; 
                          return(x)})

然后我们得到结果

    > lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE)); x <- x[,-1]; rownames(x)[nrow(x)] <- "mscores"; return(x)})
$Al2O3
            2     3    4     5    7      8       10    12
1       2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
2       2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
3       2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
4       2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
5       2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
6       2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
mscores 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055

$As
            2      3     4     5     7     8   10 12
1       0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
2       0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
3       0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
4       0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
5       0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
6       0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
mscores 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA

$Fe
            2      3     4     5     7      8   10     12
1       55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
2       55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
3       56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
4       56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
5       55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
6       55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
mscores 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355

Answer 2

到目前为止，将函数应用于 data.frame 中的所有或部分列的规范方法是 lapply。

我认为 mapply 在您那里的使用中没有用处：它的第一个参数需要是 function，未求值，而不是像 median(.) 中那样求值的表达式。不过，我认为我们可以改用 lapply。

FYI, these two calls are an equivalency between mapply and lapply:
lapply(list(1:2, 3:4), sum)
mapply(sum, list(1:2, 3:4))
However, only mapply can do this directly (lapply can do it using indices instead of raw data):
mapply(function(x, y) sum(x) + y, list(1:2, 3:4), list(5, 6))
which "unrolls" into
sum(1:2) + 5
sum(3:4) + 6

我的第一步是聚合数据，这非常简单。

tmp <- lapply(z, function(x) as.data.frame(lapply(x[,-1], median, na.rm=TRUE), check.names = FALSE))
tmp
# $Al2O3
#       2     3    4     5    7      8       10    12
# 1 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
#       2      3     4     5     7     8   10 12
# 1 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
#       2      3     4     5     7      8   10     12
# 1 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355

我可以很容易地将此数据附加到原始数据并按照建议添加一个行名，但我想简要讨论一下：

如果你要对数据做更多的analysis/calculation，我认为在数据本身中添加汇总统计并不是一件好事；
没有关于如何处理的明确指导 Determination_No;
行名称可能很脆弱；虽然许多 R 函数可以很好地使用和保存它们，但有些函数不会......并且来自 dplyr 和相关包（如果你使用它们）的大多数函数都不会努力保存它们（偶尔会故意擦除它们）;

最终，我怀疑您希望将其添加为最后一行是为了演示，所以最好（在我的脑海中）作为渲染过程的一部分来完成。例如，

Map(function(nm, dat, smry) {
  dat$Determination_No <- as.character(dat$Determination_No)
  rbind(dat, cbind(data.frame(Determination_No = paste(nm, "median -->")), smry))
}, names(z), z, tmp)
# $Al2O3
#   Determination_No     2     3    4     5    7      8       10    12
# 1                1 2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
# 2                2 2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
# 3                3 2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
# 4                4 2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
# 5                5 2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
# 6                6 2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
# 7 Al2O3 median --> 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
#   Determination_No     2      3     4     5     7     8   10 12
# 1                1 0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
# 2                2 0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
# 3                3 0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
# 4                4 0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
# 5                5 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 6                6 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 7    As median --> 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
#   Determination_No     2      3     4     5     7      8   10     12
# 1                1 55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
# 2                2 55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
# 3                3 56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
# 4                4 56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
# 5                5 55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
# 6                6 55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
# 7    Fe median --> 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355

使用 table 渲染工具可以完成更多工作，例如 knitr::kable、gt 包、DT（如果在 html 环境）等。

为数据帧列表中的每个数据帧按列计算列中值绝对偏差 (MAD)

Calculate column median absolute deviation (MAD) by column for each dataframe in a list of dataframes

r

lapply

dataframe