仅过滤完整的年份集
Filtering for only complete sets of years
我有按州和县组织的产量数据。在这些数据中,我只想保留那些提供 1970 年到 2000 年之间完整年份的县。
以下代码清除了一些不完整的案例,但未能忽略所有案例 - 特别是对于更大的数据集。假数据
一些假数据:
假数据
K <- 5 # number of rows set to NaN
df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
county = rep(1:4, 5), yield = 100)
df[sample(1:20, K), 3] <- NaN
当前代码:
df1 <- read.csv("gly2.csv",header=TRUE)
df <- data.frame(df1)
droprows_1 <- function(df, v1, v2, v3, value = 'x'){
idx <- df[, v3] == value
todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
todrop <- unique(todrop); todrop # but unique values could be less
nrow <- dim(todrop)[1]
for(i in 1:nrow){
idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
df <- df[!idx, ]
}
return(df)
}
qq <- droprows_1(df, 1, 2, 3)
谢谢
要删除具有单个缺失值的县,请使用:
library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))
这在 data.table
中很容易。我并没有完全按照你的例子,但这个样本数据得到了我认为你正在寻找的东西:
N = 20000L
DT = data.table(
state = sample(letters, size = N, replace = TRUE),
county = sample(20L, size=N, replace = TRUE),
year = rep(1981:2000, length.out = N),
var = rnorm(N),
key = c("state", "county", "year")
)
# Duplicated a bunch of state/year combinations
DT = unique(DT, by = c("state", "county", "year"))
现在,回答您的问题。如果您是 data.table
的新手,我会逐步介绍;最后一行是您真正需要的。
# This will count the number of years for each state/county combination:
DT[ , .N, by = .(state, county)]
# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
DT[ , .N, by = .(state, county)][N==20L, !"N"]
# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data = DT[.(DT[ , .N, by = .(state, county)][N==20L, !"N"])]
请注意,最后一步要求我们将DT
的键依次设置为state
和county
,这可以用 setkey(DT, state, county)
完成。如果您不熟悉 data.table
表示法,我推荐 this page and in particular this vignette。
编辑:刚看到您可能正在为 year
存储 NA
值,在这种情况下您应该调整代码以摆脱对 NA
s 的计数:
full_data = DT[.(DT[!is.na(year), .N, by = .(state, county)][N==20L, !"N"])]
我有按州和县组织的产量数据。在这些数据中,我只想保留那些提供 1970 年到 2000 年之间完整年份的县。
以下代码清除了一些不完整的案例,但未能忽略所有案例 - 特别是对于更大的数据集。假数据
一些假数据:
假数据
K <- 5 # number of rows set to NaN
df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
county = rep(1:4, 5), yield = 100)
df[sample(1:20, K), 3] <- NaN
当前代码:
df1 <- read.csv("gly2.csv",header=TRUE)
df <- data.frame(df1)
droprows_1 <- function(df, v1, v2, v3, value = 'x'){
idx <- df[, v3] == value
todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
todrop <- unique(todrop); todrop # but unique values could be less
nrow <- dim(todrop)[1]
for(i in 1:nrow){
idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
df <- df[!idx, ]
}
return(df)
}
qq <- droprows_1(df, 1, 2, 3)
谢谢
要删除具有单个缺失值的县,请使用:
library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))
这在 data.table
中很容易。我并没有完全按照你的例子,但这个样本数据得到了我认为你正在寻找的东西:
N = 20000L
DT = data.table(
state = sample(letters, size = N, replace = TRUE),
county = sample(20L, size=N, replace = TRUE),
year = rep(1981:2000, length.out = N),
var = rnorm(N),
key = c("state", "county", "year")
)
# Duplicated a bunch of state/year combinations
DT = unique(DT, by = c("state", "county", "year"))
现在,回答您的问题。如果您是 data.table
的新手,我会逐步介绍;最后一行是您真正需要的。
# This will count the number of years for each state/county combination:
DT[ , .N, by = .(state, county)]
# To focus on only those combinations which appear for every year
# (in my example there are 20 years)
# (also simultaneously drop the N column since we know every N is 20)
DT[ , .N, by = .(state, county)][N==20L, !"N"]
# The grande finale: reduce your data set to
# ONLY those combinations with full support:
full_data = DT[.(DT[ , .N, by = .(state, county)][N==20L, !"N"])]
请注意,最后一步要求我们将DT
的键依次设置为state
和county
,这可以用 setkey(DT, state, county)
完成。如果您不熟悉 data.table
表示法,我推荐 this page and in particular this vignette。
编辑:刚看到您可能正在为 year
存储 NA
值,在这种情况下您应该调整代码以摆脱对 NA
s 的计数:
full_data = DT[.(DT[!is.na(year), .N, by = .(state, county)][N==20L, !"N"])]