循环遍历 R 中的聚合数据
Looping through aggregated data in R
我正在尝试估算数据框特定列中的缺失值。
我的意图是用其他列的组来代替它。
我使用 aggregate
:
保存了聚合结果
# Replace LotFrontage missing values by Neighborhood mean
lot_frontage_by_neighborhood = aggregate(LotFrontage ~ Neighborhood, combined, mean)
现在我想实现这样的东西:
for key, group in lot_frontage_by_neighborhood:
idx = (combined["Neighborhood"] == key) & (combined["LotFrontage"].isnull())
combined[idx, "LotFrontage"] = group.median()
这当然是python代码。
不确定如何在 R 中实现这一点,有人可以帮忙吗?
例如:
Neighborhood LotFrontage
A 20
A 30
B 20
B 50
A <NA>
NA 记录应替换为 25(邻域 A 中所有记录的平均 LotFrontage)
谢谢
这是您正在寻找的想法吗?您可能需要 which() 函数来确定哪些行具有 NA 值。
set.seed(1)
Neighborhood = sample(letters[1:4], 10, TRUE)
LotFrontage = rnorm(10,0,1)
LotFrontage[sample(10, 2)] = NA
# This data frame has 2 columns. LotFrontage column has 10 missing values.
df = data.frame(Neighborhood = Neighborhood, LotFrontage = LotFrontage)
# Sets the missing values in the Neighborhood column to the mean of the LotFrontage values from the rows with that Neighborhood
x<-df[which(is.na(df$LotFrontage)),]$Neighborhood
f<-function(x) mean(df[(df$Neighborhood==x),]$LotFrontage, na.rm =TRUE)
df[which(is.na(df$LotFrontage)),]$LotFrontage <- lapply(x,f)
我正在尝试估算数据框特定列中的缺失值。
我的意图是用其他列的组来代替它。
我使用 aggregate
:
# Replace LotFrontage missing values by Neighborhood mean
lot_frontage_by_neighborhood = aggregate(LotFrontage ~ Neighborhood, combined, mean)
现在我想实现这样的东西:
for key, group in lot_frontage_by_neighborhood:
idx = (combined["Neighborhood"] == key) & (combined["LotFrontage"].isnull())
combined[idx, "LotFrontage"] = group.median()
这当然是python代码。
不确定如何在 R 中实现这一点,有人可以帮忙吗?
例如:
Neighborhood LotFrontage
A 20
A 30
B 20
B 50
A <NA>
NA 记录应替换为 25(邻域 A 中所有记录的平均 LotFrontage)
谢谢
这是您正在寻找的想法吗?您可能需要 which() 函数来确定哪些行具有 NA 值。
set.seed(1)
Neighborhood = sample(letters[1:4], 10, TRUE)
LotFrontage = rnorm(10,0,1)
LotFrontage[sample(10, 2)] = NA
# This data frame has 2 columns. LotFrontage column has 10 missing values.
df = data.frame(Neighborhood = Neighborhood, LotFrontage = LotFrontage)
# Sets the missing values in the Neighborhood column to the mean of the LotFrontage values from the rows with that Neighborhood
x<-df[which(is.na(df$LotFrontage)),]$Neighborhood
f<-function(x) mean(df[(df$Neighborhood==x),]$LotFrontage, na.rm =TRUE)
df[which(is.na(df$LotFrontage)),]$LotFrontage <- lapply(x,f)