为数据帧列表中的每个数据帧按列计算列中值绝对偏差 (MAD)
Calculate column median absolute deviation (MAD) by column for each dataframe in a list of dataframes
我想按列计算中值绝对偏差 (mscore),忽略数据帧列表中每个数据帧的第一列。然后将结果作为新行添加到行名称为 mscore 的数据框中。
以前我会一次对每个数据帧进行计算,但现在它简化了流程。
下面是我的数据框列表的一小部分。 dfs 的完整列表有超过 30 个数据帧
list(Al2O3 = structure(list(Determination_No = 1:6, `2` = c(2.01,
2.02, 2.03, 2.01, 2.02, 2), `3` = c(2.01, 2.01, 2, 2.02, 2.02,
2.03), `4` = c(2, 2.03, 1.99, 2.01, 2.01, 2.01), `5` = c(2.02,
2.02, 2.05, 2.03, 2.02, 2.03), `7` = c(1.88, 1.9, 1.89, 1.88,
1.88, 1.87), `8` = c(2.053, 2.044, 2.041, 2.038, 2.008, 2.02),
`10` = c(2.002830415, 2.021725042, 2.021725042, 1.983935789,
2.002830415, 2.021725042), `12` = c(2.09, 2.05, 1.96, 2.09,
2.06, 2.02)), class = "data.frame", row.names = c(NA, -6L
)), As = structure(list(Determination_No = 1:6, `2` = c(0.052,
0.027, 0.011, 0.011, 0.012, 0.012), `3` = c(0.012, 0.012, 0.013,
0.012, 0.013, 0.013), `4` = c(0.012, 0.012, 0.013, 0.012, 0.012,
0.012), `5` = c(0.013, 0.013, 0.013, 0.013, 0.013, 0.013), `7` = c(0.011,
0.011, 0.011, 0.012, 0.011, 0.011), `8` = c(0.011, 0.01, 0.011,
0.011, 0.011, 0.011), `10` = c(0.01, 0.01, 0.01, 0.01, 0.01,
0.01), `12` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), class = "data.frame", row.names = c(NA, -6L)), Fe = structure(list(
Determination_No = 1:6, `2` = c(55.94, 55.7, 56.59, 56.5,
55.98, 55.93), `3` = c(56.83, 56.54, 56.18, 56.5, 56.51,
56.34), `4` = c(56.39, 56.43, 56.53, 56.31, 56.47, 56.35),
`5` = c(56.32, 56.29, 56.31, 56.32, 56.39, 56.32), `7` = c(56.48,
56.4, 56.54, 56.43, 56.73, 56.62), `8` = c(56.382, 56.258,
56.442, 56.258, 56.532, 56.264), `10` = c(56.3, 56.5, 56.2,
56.5, 56.7, 56.5), `12` = c(56.11, 56.46, 56.1, 56.35, 56.36,
56.37)), class = "data.frame", row.names = c(NA, -6L)))
以前我会做以下事情
#create a modified scores function to accept NAs
scores_na <- function(x, ...) {
not_na <- !is.na(x)
scores <- rep(NA, length(x))
scores[not_na] <- outliers::scores(na.omit(x), ...)
scores
}
MscoreMax <- 3.0 # the the threshold to remove values deemed to be an outlier
colmedians <- median, df[-1], na.rm = T)
MScore <- as.vector(round(abs(scores_na(colmedians, "mad")), digits = 2)) #Mscore to 2 decimals
places
MscoreIndex <- which(MScore > MscoreMax) #get the index of each value exceeding the threshold
df[-1][Fe.MscoreIndex] <- NA # change outliers to NA so they are excluded from further calculations
我已经尝试使用下面的行来计算中位数
colmedians 函数用于矩阵,因此我使用 mapply 跨列应用
df <- lapply(df, function(x) rbind(x[,-1],
mapply(median(x[,-1],na.rm = TRUE))))
但是我得到了跟随错误
Error in median.default(x[, -1], na.rm = TRUE) : need numeric data
当我查询数据帧时,我的值存储为双精度值,所以有点卡住了。
尝试以下方法
lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE));
x <- x[,-1];
rownames(x)[nrow(x)] <- "mscores";
return(x)})
然后我们得到结果
> lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE)); x <- x[,-1]; rownames(x)[nrow(x)] <- "mscores"; return(x)})
$Al2O3
2 3 4 5 7 8 10 12
1 2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
2 2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
3 2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
4 2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
5 2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
6 2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
mscores 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
$As
2 3 4 5 7 8 10 12
1 0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
2 0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
3 0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
4 0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
5 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
6 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
mscores 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
$Fe
2 3 4 5 7 8 10 12
1 55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
2 55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
3 56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
4 56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
5 55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
6 55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
mscores 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
到目前为止,将函数应用于 data.frame
中的所有或部分列的规范方法是 lapply
。
我认为 mapply
在您那里的使用中没有用处:它的第一个参数需要是 function
,未求值,而不是像 median(.)
中那样求值的表达式。不过,我认为我们可以改用 lapply
。
FYI, these two calls are an equivalency between mapply
and lapply
:
lapply(list(1:2, 3:4), sum)
mapply(sum, list(1:2, 3:4))
However, only mapply
can do this directly (lapply
can do it using indices instead of raw data):
mapply(function(x, y) sum(x) + y, list(1:2, 3:4), list(5, 6))
which "unrolls" into
sum(1:2) + 5
sum(3:4) + 6
我的第一步是聚合数据,这非常简单。
tmp <- lapply(z, function(x) as.data.frame(lapply(x[,-1], median, na.rm=TRUE), check.names = FALSE))
tmp
# $Al2O3
# 2 3 4 5 7 8 10 12
# 1 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
# 2 3 4 5 7 8 10 12
# 1 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
# 2 3 4 5 7 8 10 12
# 1 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
我可以很容易地将此数据附加到原始数据并按照建议添加一个行名,但我想简要讨论一下:
- 如果你要对数据做更多的analysis/calculation,我认为在数据本身中添加汇总统计并不是一件好事;
- 没有关于如何处理的明确指导
Determination_No
;
- 行名称可能很脆弱;虽然许多 R 函数可以很好地使用和保存它们,但有些函数不会......并且来自
dplyr
和相关包(如果你使用它们)的大多数函数都不会努力保存它们(偶尔会故意擦除它们);
最终,我怀疑您希望将其添加为最后一行是为了演示,所以最好(在我的脑海中)作为渲染过程的一部分来完成。例如,
Map(function(nm, dat, smry) {
dat$Determination_No <- as.character(dat$Determination_No)
rbind(dat, cbind(data.frame(Determination_No = paste(nm, "median -->")), smry))
}, names(z), z, tmp)
# $Al2O3
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
# 2 2 2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
# 3 3 2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
# 4 4 2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
# 5 5 2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
# 6 6 2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
# 7 Al2O3 median --> 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
# 2 2 0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
# 3 3 0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
# 4 4 0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
# 5 5 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 6 6 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 7 As median --> 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
# 2 2 55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
# 3 3 56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
# 4 4 56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
# 5 5 55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
# 6 6 55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
# 7 Fe median --> 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
使用 table 渲染工具可以完成更多工作,例如 knitr::kable
、gt
包、DT
(如果在 html 环境)等。
我想按列计算中值绝对偏差 (mscore),忽略数据帧列表中每个数据帧的第一列。然后将结果作为新行添加到行名称为 mscore 的数据框中。
以前我会一次对每个数据帧进行计算,但现在它简化了流程。
下面是我的数据框列表的一小部分。 dfs 的完整列表有超过 30 个数据帧
list(Al2O3 = structure(list(Determination_No = 1:6, `2` = c(2.01,
2.02, 2.03, 2.01, 2.02, 2), `3` = c(2.01, 2.01, 2, 2.02, 2.02,
2.03), `4` = c(2, 2.03, 1.99, 2.01, 2.01, 2.01), `5` = c(2.02,
2.02, 2.05, 2.03, 2.02, 2.03), `7` = c(1.88, 1.9, 1.89, 1.88,
1.88, 1.87), `8` = c(2.053, 2.044, 2.041, 2.038, 2.008, 2.02),
`10` = c(2.002830415, 2.021725042, 2.021725042, 1.983935789,
2.002830415, 2.021725042), `12` = c(2.09, 2.05, 1.96, 2.09,
2.06, 2.02)), class = "data.frame", row.names = c(NA, -6L
)), As = structure(list(Determination_No = 1:6, `2` = c(0.052,
0.027, 0.011, 0.011, 0.012, 0.012), `3` = c(0.012, 0.012, 0.013,
0.012, 0.013, 0.013), `4` = c(0.012, 0.012, 0.013, 0.012, 0.012,
0.012), `5` = c(0.013, 0.013, 0.013, 0.013, 0.013, 0.013), `7` = c(0.011,
0.011, 0.011, 0.012, 0.011, 0.011), `8` = c(0.011, 0.01, 0.011,
0.011, 0.011, 0.011), `10` = c(0.01, 0.01, 0.01, 0.01, 0.01,
0.01), `12` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), class = "data.frame", row.names = c(NA, -6L)), Fe = structure(list(
Determination_No = 1:6, `2` = c(55.94, 55.7, 56.59, 56.5,
55.98, 55.93), `3` = c(56.83, 56.54, 56.18, 56.5, 56.51,
56.34), `4` = c(56.39, 56.43, 56.53, 56.31, 56.47, 56.35),
`5` = c(56.32, 56.29, 56.31, 56.32, 56.39, 56.32), `7` = c(56.48,
56.4, 56.54, 56.43, 56.73, 56.62), `8` = c(56.382, 56.258,
56.442, 56.258, 56.532, 56.264), `10` = c(56.3, 56.5, 56.2,
56.5, 56.7, 56.5), `12` = c(56.11, 56.46, 56.1, 56.35, 56.36,
56.37)), class = "data.frame", row.names = c(NA, -6L)))
以前我会做以下事情
#create a modified scores function to accept NAs
scores_na <- function(x, ...) {
not_na <- !is.na(x)
scores <- rep(NA, length(x))
scores[not_na] <- outliers::scores(na.omit(x), ...)
scores
}
MscoreMax <- 3.0 # the the threshold to remove values deemed to be an outlier
colmedians <- median, df[-1], na.rm = T)
MScore <- as.vector(round(abs(scores_na(colmedians, "mad")), digits = 2)) #Mscore to 2 decimals
places
MscoreIndex <- which(MScore > MscoreMax) #get the index of each value exceeding the threshold
df[-1][Fe.MscoreIndex] <- NA # change outliers to NA so they are excluded from further calculations
我已经尝试使用下面的行来计算中位数
colmedians 函数用于矩阵,因此我使用 mapply 跨列应用
df <- lapply(df, function(x) rbind(x[,-1],
mapply(median(x[,-1],na.rm = TRUE))))
但是我得到了跟随错误
Error in median.default(x[, -1], na.rm = TRUE) : need numeric data
当我查询数据帧时,我的值存储为双精度值,所以有点卡住了。
尝试以下方法
lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE));
x <- x[,-1];
rownames(x)[nrow(x)] <- "mscores";
return(x)})
然后我们得到结果
> lapply(A, function(x) {x[nrow(x) + 1,] <- apply(x, 2, function(y) median(y, na.rm = TRUE)); x <- x[,-1]; rownames(x)[nrow(x)] <- "mscores"; return(x)})
$Al2O3
2 3 4 5 7 8 10 12
1 2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
2 2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
3 2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
4 2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
5 2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
6 2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
mscores 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
$As
2 3 4 5 7 8 10 12
1 0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
2 0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
3 0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
4 0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
5 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
6 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
mscores 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
$Fe
2 3 4 5 7 8 10 12
1 55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
2 55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
3 56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
4 56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
5 55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
6 55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
mscores 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
到目前为止,将函数应用于 data.frame
中的所有或部分列的规范方法是 lapply
。
我认为 mapply
在您那里的使用中没有用处:它的第一个参数需要是 function
,未求值,而不是像 median(.)
中那样求值的表达式。不过,我认为我们可以改用 lapply
。
FYI, these two calls are an equivalency between
mapply
andlapply
:lapply(list(1:2, 3:4), sum) mapply(sum, list(1:2, 3:4))
However, only
mapply
can do this directly (lapply
can do it using indices instead of raw data):mapply(function(x, y) sum(x) + y, list(1:2, 3:4), list(5, 6))
which "unrolls" into
sum(1:2) + 5 sum(3:4) + 6
我的第一步是聚合数据,这非常简单。
tmp <- lapply(z, function(x) as.data.frame(lapply(x[,-1], median, na.rm=TRUE), check.names = FALSE))
tmp
# $Al2O3
# 2 3 4 5 7 8 10 12
# 1 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
# 2 3 4 5 7 8 10 12
# 1 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
# 2 3 4 5 7 8 10 12
# 1 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
我可以很容易地将此数据附加到原始数据并按照建议添加一个行名,但我想简要讨论一下:
- 如果你要对数据做更多的analysis/calculation,我认为在数据本身中添加汇总统计并不是一件好事;
- 没有关于如何处理的明确指导
Determination_No
; - 行名称可能很脆弱;虽然许多 R 函数可以很好地使用和保存它们,但有些函数不会......并且来自
dplyr
和相关包(如果你使用它们)的大多数函数都不会努力保存它们(偶尔会故意擦除它们);
最终,我怀疑您希望将其添加为最后一行是为了演示,所以最好(在我的脑海中)作为渲染过程的一部分来完成。例如,
Map(function(nm, dat, smry) {
dat$Determination_No <- as.character(dat$Determination_No)
rbind(dat, cbind(data.frame(Determination_No = paste(nm, "median -->")), smry))
}, names(z), z, tmp)
# $Al2O3
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 2.010 2.010 2.00 2.020 1.88 2.0530 2.002830 2.090
# 2 2 2.020 2.010 2.03 2.020 1.90 2.0440 2.021725 2.050
# 3 3 2.030 2.000 1.99 2.050 1.89 2.0410 2.021725 1.960
# 4 4 2.010 2.020 2.01 2.030 1.88 2.0380 1.983936 2.090
# 5 5 2.020 2.020 2.01 2.020 1.88 2.0080 2.002830 2.060
# 6 6 2.000 2.030 2.01 2.030 1.87 2.0200 2.021725 2.020
# 7 Al2O3 median --> 2.015 2.015 2.01 2.025 1.88 2.0395 2.012278 2.055
# $As
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 0.052 0.0120 0.012 0.013 0.011 0.011 0.01 NA
# 2 2 0.027 0.0120 0.012 0.013 0.011 0.010 0.01 NA
# 3 3 0.011 0.0130 0.013 0.013 0.011 0.011 0.01 NA
# 4 4 0.011 0.0120 0.012 0.013 0.012 0.011 0.01 NA
# 5 5 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 6 6 0.012 0.0130 0.012 0.013 0.011 0.011 0.01 NA
# 7 As median --> 0.012 0.0125 0.012 0.013 0.011 0.011 0.01 NA
# $Fe
# Determination_No 2 3 4 5 7 8 10 12
# 1 1 55.94 56.830 56.39 56.32 56.48 56.382 56.3 56.110
# 2 2 55.70 56.540 56.43 56.29 56.40 56.258 56.5 56.460
# 3 3 56.59 56.180 56.53 56.31 56.54 56.442 56.2 56.100
# 4 4 56.50 56.500 56.31 56.32 56.43 56.258 56.5 56.350
# 5 5 55.98 56.510 56.47 56.39 56.73 56.532 56.7 56.360
# 6 6 55.93 56.340 56.35 56.32 56.62 56.264 56.5 56.370
# 7 Fe median --> 55.96 56.505 56.41 56.32 56.51 56.323 56.5 56.355
使用 table 渲染工具可以完成更多工作,例如 knitr::kable
、gt
包、DT
(如果在 html 环境)等。