通过基于两列随机选择行来对数据进行子集化
Subset data by randomly selecting rows based on two columns
我有一个很大的 data.frame,我想用基于两列的随机选择的行创建一个新的 data.frame。
有 90 个唯一的 elkID,每个 FixDate 约 48 行。我想制作一个新的 data.frame,其中包含 90 个唯一的 elkID,每个 FixDate 随机选择 4 行。
数据如下所示:
> head(df)
elkID X Y Fix.Date.Time FixDate
1 245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2 245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3 245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4 245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5 245 549977.3 4826438 2010-02-24 12:00:55 2010-02-24
6 245 549795.1 4826294 2010-02-24 12:30:29 2010-02-24
我希望它看起来像这样(每个唯一的 elkID 每个 FixDate 4 行):
> df2
elkID X Y Fix.Date.Time FixDate
1 245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2 245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3 245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4 245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5 245 549977.3 4826438 2010-02-24 12:00:55 2010-02-25
6 245 549795.1 4826294 2010-02-24 12:30:29 2010-02-25
使用 RStudio V0.99.467 和 R3.2.1
如果您想遍历它们,您可以尝试以下操作:
# initialize a new dataframe to store new data
newdf = NULL
# extract unique elk IDs
IDs = unique(df$elkID)
# create a loop to subset each ID first (i loop) and secondly
# loop through the unique dates (j loop)
for(i in 1:length(IDs)){
data1 = df[df$elkID == IDs[i],]
dates = unique(data1$FixDate)
for(j in 1:length(dates)){
data2 = data1[data1$FixDate == dates[j],]
# this should select 4 rows at random for each particular ID and date
data2 = data2[sample(1:nrow(data2),4),]
newdf = rbind(newdf,data2)
}
}
head(newdf)
tail(newdf)
这是否符合您的要求?
对于大型数据框,我推荐包 data.table
:
library(data.table)
setDT(df)
df[, .SD[sample(.N, 4)] , by=.(elkID, FixDate)] #or
df[, .SD[sample(.N, 4)] , keyby=.(elkID, FixDate)]
我有一个很大的 data.frame,我想用基于两列的随机选择的行创建一个新的 data.frame。
有 90 个唯一的 elkID,每个 FixDate 约 48 行。我想制作一个新的 data.frame,其中包含 90 个唯一的 elkID,每个 FixDate 随机选择 4 行。
数据如下所示:
> head(df)
elkID X Y Fix.Date.Time FixDate
1 245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2 245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3 245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4 245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5 245 549977.3 4826438 2010-02-24 12:00:55 2010-02-24
6 245 549795.1 4826294 2010-02-24 12:30:29 2010-02-24
我希望它看起来像这样(每个唯一的 elkID 每个 FixDate 4 行):
> df2
elkID X Y Fix.Date.Time FixDate
1 245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2 245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3 245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4 245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5 245 549977.3 4826438 2010-02-24 12:00:55 2010-02-25
6 245 549795.1 4826294 2010-02-24 12:30:29 2010-02-25
使用 RStudio V0.99.467 和 R3.2.1
如果您想遍历它们,您可以尝试以下操作:
# initialize a new dataframe to store new data
newdf = NULL
# extract unique elk IDs
IDs = unique(df$elkID)
# create a loop to subset each ID first (i loop) and secondly
# loop through the unique dates (j loop)
for(i in 1:length(IDs)){
data1 = df[df$elkID == IDs[i],]
dates = unique(data1$FixDate)
for(j in 1:length(dates)){
data2 = data1[data1$FixDate == dates[j],]
# this should select 4 rows at random for each particular ID and date
data2 = data2[sample(1:nrow(data2),4),]
newdf = rbind(newdf,data2)
}
}
head(newdf)
tail(newdf)
这是否符合您的要求?
对于大型数据框,我推荐包 data.table
:
library(data.table)
setDT(df)
df[, .SD[sample(.N, 4)] , by=.(elkID, FixDate)] #or
df[, .SD[sample(.N, 4)] , keyby=.(elkID, FixDate)]