仅过滤完整的年份集

Filtering for only complete sets of years

我有按州和县组织的产量数据。在这些数据中,我只想保留那些提供 1970 年到 2000 年之间完整年份的县。

以下代码清除了一些不完整的案例,但未能忽略所有案例 - 特别是对于更大的数据集。假数据

一些假数据:

假数据

K <- 5 # number of rows set to NaN

df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
                 county = rep(1:4, 5), yield = 100)

df[sample(1:20, K), 3] <- NaN

当前代码:

df1 <- read.csv("gly2.csv",header=TRUE)

df <- data.frame(df1)


droprows_1 <- function(df, v1, v2, v3, value = 'x'){
  idx <- df[, v3] == value
  todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
  todrop <- unique(todrop); todrop # but unique values could be less

  nrow <- dim(todrop)[1]
  for(i in 1:nrow){
    idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
    df <- df[!idx, ]
  }
  return(df)
}

qq <- droprows_1(df, 1, 2, 3)

谢谢

要删除具有单个缺失值的县,请使用:

library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))

这在 data.table 中很容易。我并没有完全按照你的例子,但这个样本数据得到了我认为你正在寻找的东西:

N = 20000L
DT = data.table(
  state = sample(letters, size = N, replace = TRUE),
  county = sample(20L, size=N, replace = TRUE),
  year = rep(1981:2000, length.out = N),
  var = rnorm(N),
  key = c("state", "county", "year")
)

# Duplicated a bunch of state/year combinations
DT = unique(DT, by = c("state", "county", "year"))

现在,回答您的问题。如果您是 data.table 的新手,我会逐步介绍;最后一行是您真正需要的。

# This will count the number of years for each state/county combination:
DT[ , .N, by = .(state, county)]

# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
DT[ , .N, by = .(state, county)][N==20L, !"N"]

# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data = DT[.(DT[ , .N, by = .(state, county)][N==20L, !"N"])]

请注意,最后一步要求我们将DT的键依次设置为statecounty,这可以用 setkey(DT, state, county) 完成。如果您不熟悉 data.table 表示法,我推荐 this page and in particular this vignette。


编辑:刚看到您可能正在为 year 存储 NA 值,在这种情况下您应该调整代码以摆脱对 NAs 的计数:

full_data = DT[.(DT[!is.na(year), .N, by = .(state, county)][N==20L, !"N"])]