对于循环警告:"number of items to replace is not a multiple of replacement length" 有两个数据帧
For loop warning: "number of items to replace is not a multiple of replacement length" with two dataframes
我正在尝试根据来自另一个数据帧的数据对我的一个数据帧中的变量应用转换来创建一个新向量。
我有两个数据帧 df1 和 df2。 df1 和 df2 具有不同的维度,我在 df1 中有超过 20,000 行,在 df2 中有 76 行。
df1 是我的原始数据集。我为 Ag_ppm 创建了 df2,如下所示:
df2 <- df1%>%
filter(!is.na(Ag_ppm)) %>%
group_by(Year,Zone, SubZone) %>%
summarise(
n = sum(!is.na(Ag_ppm)),
min = min(Ag_ppm),
max = max(Ag_ppm),
mean = mean(Ag_ppm),
sd = sd(Ag_ppm),
iqr = IQR(Ag_ppm),
Q1 = quantile(Ag_ppm, 0.25),
median = median(Ag_ppm),
Q3 = quantile(Ag_ppm, 0.75),
LW = min(Ag_ppm > (quantile(Ag_ppm, .25)-1.5*IQR(Ag_ppm))),
UF = quantile(Ag_ppm, .75) + 1.5*IQR(Ag_ppm))
每个数据框的第一行如下所示:
head(df1, n=5)
# A tibble: 5 x 12
Year Zone SubZone Au_ppm Ag_ppm Cu_ppm Pb_ppm Zn_ppm As_ppm Sb_ppm Bi_ppm Mo_ppm
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 BugLake BugLake 0.007 3.7 17 27 23 1 1 NA 1
2 1983 Johnny Mountain Johnny Mountain 0.01 1.6 71 63 550 4 NA NA NA
3 1983 Khyber Pass Khyber Pass 0.12 11.5 275 204 8230 178 7 60 NA
4 1987 Chebry Ridge Line Grid 0.05 2.2 35 21 105 16 6 NA NA
5 1987 Chebry Handel Grid 0.004 1.3 29 27 663 45 2 NA NA
head(df2, n=5)
# A tibble: 5 x 14
# Groups: Year, Zone [3]
Year Zone SubZone n min max mean sd iqr Q1 median Q3 LW UF
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1981 Chebry Handel 52 0.6 5.1 1.83 0.947 0.925 1.2 1.6 2.12 1 3.51
2 1981 Imperial Metals Handel 24 0.9 6.9 2.81 1.43 1.35 1.95 2.65 3.3 1 5.33
3 1983 Chebry Chebry 5 0.7 3.7 1.78 1.19 0.9 1.2 1.2 2.1 1 3.45
4 1983 Chebry Handel 17 0.1 0.7 0.318 0.163 0.2 0.2 0.3 0.4 1 0.7
5 1983 Chebry Handel Grid 225 0.1 16 0.892 1.33 0.7 0.3 0.6 1 1 2.05
我想使用为 df2 中的每个子组计算的中位数和 IQR,将以下等式应用于我在 df1 中的列 Ag_ppm:
Z = (X - 中位数)/IQR
为此,我写道:
# Initialize Ag_std vector with NA values
Ag_std <- rep(NA, times = nrow(df1))
# Populate Ag_std vector with standardized Ag values
Ag_std <-
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr
}
}
但是循环不起作用(它 returns 一个 NULL 向量)并且我有这个警告:
1: In Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr :
number of items to replace is not a multiple of replacement length
我看过类似的问题,但没有找到适合我的答案。任何帮助将非常感激!
如果有更好的方法无需循环即可实现同样的效果(我相信有,例如 apply()),我也很感激这样的评论。不幸的是,我对替代方案不够熟悉,无法快速实施它们。
因为你有 df2
作为一个单独的数据框,你可以 join
和 mutate
:
df1 %>%
left_join(df2, by = c("Year", "Zone", "SubZone")) %>%
mutate(Z = (Ag_ppm - median) / iqr)
事实上,您可以使用 summarise
在 df1 本身中生成 df2 中的信息
这在 data.table
中可以相对容易地完成
library(data.table)
DT <- data.table(df1)
#function to apply
fun <- function(x) (x - median(x)) / diff (quantile( x, c(.25, .75)))
# create a new column with desired result
DT[, Ag_std := fun(Ag_ppm), by = list(Year, Zone, SubZone)]
此外,我认为可以通过将 'filter' 的结果分配给临时对象来修复循环
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
temp.var <- filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - temp.var$median)/temp.var$iqr
}
}
我正在尝试根据来自另一个数据帧的数据对我的一个数据帧中的变量应用转换来创建一个新向量。
我有两个数据帧 df1 和 df2。 df1 和 df2 具有不同的维度,我在 df1 中有超过 20,000 行,在 df2 中有 76 行。 df1 是我的原始数据集。我为 Ag_ppm 创建了 df2,如下所示:
df2 <- df1%>%
filter(!is.na(Ag_ppm)) %>%
group_by(Year,Zone, SubZone) %>%
summarise(
n = sum(!is.na(Ag_ppm)),
min = min(Ag_ppm),
max = max(Ag_ppm),
mean = mean(Ag_ppm),
sd = sd(Ag_ppm),
iqr = IQR(Ag_ppm),
Q1 = quantile(Ag_ppm, 0.25),
median = median(Ag_ppm),
Q3 = quantile(Ag_ppm, 0.75),
LW = min(Ag_ppm > (quantile(Ag_ppm, .25)-1.5*IQR(Ag_ppm))),
UF = quantile(Ag_ppm, .75) + 1.5*IQR(Ag_ppm))
每个数据框的第一行如下所示:
head(df1, n=5)
# A tibble: 5 x 12
Year Zone SubZone Au_ppm Ag_ppm Cu_ppm Pb_ppm Zn_ppm As_ppm Sb_ppm Bi_ppm Mo_ppm
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 BugLake BugLake 0.007 3.7 17 27 23 1 1 NA 1
2 1983 Johnny Mountain Johnny Mountain 0.01 1.6 71 63 550 4 NA NA NA
3 1983 Khyber Pass Khyber Pass 0.12 11.5 275 204 8230 178 7 60 NA
4 1987 Chebry Ridge Line Grid 0.05 2.2 35 21 105 16 6 NA NA
5 1987 Chebry Handel Grid 0.004 1.3 29 27 663 45 2 NA NA
head(df2, n=5)
# A tibble: 5 x 14
# Groups: Year, Zone [3]
Year Zone SubZone n min max mean sd iqr Q1 median Q3 LW UF
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1981 Chebry Handel 52 0.6 5.1 1.83 0.947 0.925 1.2 1.6 2.12 1 3.51
2 1981 Imperial Metals Handel 24 0.9 6.9 2.81 1.43 1.35 1.95 2.65 3.3 1 5.33
3 1983 Chebry Chebry 5 0.7 3.7 1.78 1.19 0.9 1.2 1.2 2.1 1 3.45
4 1983 Chebry Handel 17 0.1 0.7 0.318 0.163 0.2 0.2 0.3 0.4 1 0.7
5 1983 Chebry Handel Grid 225 0.1 16 0.892 1.33 0.7 0.3 0.6 1 1 2.05
我想使用为 df2 中的每个子组计算的中位数和 IQR,将以下等式应用于我在 df1 中的列 Ag_ppm: Z = (X - 中位数)/IQR
为此,我写道:
# Initialize Ag_std vector with NA values
Ag_std <- rep(NA, times = nrow(df1))
# Populate Ag_std vector with standardized Ag values
Ag_std <-
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr
}
}
但是循环不起作用(它 returns 一个 NULL 向量)并且我有这个警告:
1: In Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr :
number of items to replace is not a multiple of replacement length
我看过类似的问题,但没有找到适合我的答案。任何帮助将非常感激!
如果有更好的方法无需循环即可实现同样的效果(我相信有,例如 apply()),我也很感激这样的评论。不幸的是,我对替代方案不够熟悉,无法快速实施它们。
因为你有 df2
作为一个单独的数据框,你可以 join
和 mutate
:
df1 %>%
left_join(df2, by = c("Year", "Zone", "SubZone")) %>%
mutate(Z = (Ag_ppm - median) / iqr)
事实上,您可以使用 summarise
这在 data.table
library(data.table)
DT <- data.table(df1)
#function to apply
fun <- function(x) (x - median(x)) / diff (quantile( x, c(.25, .75)))
# create a new column with desired result
DT[, Ag_std := fun(Ag_ppm), by = list(Year, Zone, SubZone)]
此外,我认为可以通过将 'filter' 的结果分配给临时对象来修复循环
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
temp.var <- filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - temp.var$median)/temp.var$iqr
}
}