中位数插补后 R 没有变化
R after median imputation nothing changes
有人知道这里会发生什么吗?我正在尝试对 NA 值进行归因,但我一无所获。这是我的数据框。我把整个东西包括在内只是因为我认为拥有完整的东西而不只是前 n 行可能会有所帮助:
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842, 1075, 917,
922, 920, 973), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 107L
), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
我看看有没有NA值
any(is.na(moneyball_training_data)) # TRUE
我发现这些 NA 值在哪里:
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
我查看具有 NA 值的变量之一的 class
class(moneyball_training_data$TEAM_BATTING_SO) # numeric
我尝试用该向量的中值来估算它:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但是当我询问是否有 NA 值时我仍然得到 TRUE...
但也许我忘了在 medican 的函数调用中删除 NA,所以我用 na.rm = TRUE
再试一次
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这不起作用。所以我以另一种方式找到中值,然后使用该值进行插补:
median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE) # 750
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这并没有将 NA 值推算为 750。但也许我应该只使用 "" 而不是 NA:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == ""] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这也不管用。有人知道为什么这种归责不起作用吗?
在创建用于子集化的 boolean
向量时,您应该使用您之前和之后已经正确使用过的 is.na()
。
moneyball_training_data$TEAM_BATTING_SO[is.na(moneyball_training_data$TEAM_BATTING_SO)] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) #
# [1] FALSE
有人知道这里会发生什么吗?我正在尝试对 NA 值进行归因,但我一无所获。这是我的数据框。我把整个东西包括在内只是因为我认为拥有完整的东西而不只是前 n 行可能会有所帮助:
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842, 1075, 917,
922, 920, 973), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 107L
), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
我看看有没有NA值
any(is.na(moneyball_training_data)) # TRUE
我发现这些 NA 值在哪里:
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
我查看具有 NA 值的变量之一的 class
class(moneyball_training_data$TEAM_BATTING_SO) # numeric
我尝试用该向量的中值来估算它:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但是当我询问是否有 NA 值时我仍然得到 TRUE...
但也许我忘了在 medican 的函数调用中删除 NA,所以我用 na.rm = TRUE
再试一次moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这不起作用。所以我以另一种方式找到中值,然后使用该值进行插补:
median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE) # 750
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这并没有将 NA 值推算为 750。但也许我应该只使用 "" 而不是 NA:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == ""] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
但这也不管用。有人知道为什么这种归责不起作用吗?
在创建用于子集化的 boolean
向量时,您应该使用您之前和之后已经正确使用过的 is.na()
。
moneyball_training_data$TEAM_BATTING_SO[is.na(moneyball_training_data$TEAM_BATTING_SO)] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) #
# [1] FALSE