循环遍历 R 中的聚合数据

Question

我正在尝试估算数据框特定列中的缺失值。

我的意图是用其他列的组来代替它。

我使用 aggregate:

保存了聚合结果

# Replace LotFrontage missing values by Neighborhood mean
lot_frontage_by_neighborhood = aggregate(LotFrontage ~ Neighborhood, combined, mean)

现在我想实现这样的东西：

for key, group in lot_frontage_by_neighborhood:
    idx = (combined["Neighborhood"] == key) & (combined["LotFrontage"].isnull())
    combined[idx, "LotFrontage"] = group.median()

这当然是python代码。

不确定如何在 R 中实现这一点，有人可以帮忙吗？

例如：

Neighborhood  LotFrontage
     A            20
     A            30
     B            20
     B            50
     A           <NA>

NA 记录应替换为 25（邻域 A 中所有记录的平均 LotFrontage）

谢谢

Answer 1

这是您正在寻找的想法吗？您可能需要 which() 函数来确定哪些行具有 NA 值。

set.seed(1)
Neighborhood = sample(letters[1:4], 10, TRUE)
LotFrontage = rnorm(10,0,1)
LotFrontage[sample(10, 2)] = NA

# This data frame has 2 columns. LotFrontage column has 10 missing values.
df = data.frame(Neighborhood = Neighborhood, LotFrontage = LotFrontage)

# Sets the missing values in the Neighborhood column to the mean of the LotFrontage values from the rows with that Neighborhood
x<-df[which(is.na(df$LotFrontage)),]$Neighborhood
f<-function(x) mean(df[(df$Neighborhood==x),]$LotFrontage, na.rm =TRUE)
df[which(is.na(df$LotFrontage)),]$LotFrontage <- lapply(x,f)

循环遍历 R 中的聚合数据

Looping through aggregated data in R

aggregate

r

dataframe

na