有没有一种简单的方法来对唯一值及其相关数据进行子集化?
Is there a simple way to subset unique values and their associated datum?
我有一个包含大约 300,000 行和 60 列的大型数据集。目前,如果我想在我的一个变量中对唯一特征进行子集化,我会使用 unique()
函数来创建一个包含该变量中所有唯一值的 data.frame
列表。然后我将它与主数据框匹配以从我的主文件中获取关联数据。
但是这个过程有点麻烦,所以我想知道是否有更快的方法来做同样的事情?例如,是否有一个函数可用于 select 唯一字段以及与这些值相关联的数据?
例如:我想制作一个新的数据框,它只包含唯一的 SurveyID_Block ID 及其相关的岛屿代码和丰度。
structure(list(SurveyID_Block = c("62003713_2", "62003087_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003713_1",
"62003713_2", "62003713_2", "62003087_1", "62003713_1", "62003713_1",
"62003713_2", "62003713_2", "62003713_1", "62003087_1", "62003087_2",
"62003713_2", "62003713_2", "62003713_2", "62003087_2", "62003713_2",
"62003713_1", "62003713_1", "62003713_1", "62003713_1", "62003713_2",
"62003713_1", "62003713_2", "62003087_1", "62003713_2", "62003087_1",
"62003713_1", "62003087_2", "62003087_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_1", "62003713_1", "62003713_1", "62003087_2",
"62003087_2", "62003713_2", "62003713_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_2", "62003087_2", "62003713_1", "62003713_1",
"62003713_2", "62003713_1", "62003713_2", "62003087_2", "62003087_2",
"62003087_1", "62003087_1", "62003713_1", "62003087_1", "62003087_1",
"62003087_2", "62003087_2", "62003713_2", "62003713_1", "62003713_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003087_1",
"62003713_1", "62003713_1", "62003087_1", "62003087_1", "62003713_1",
"62003087_2", "62003087_1", "62003087_2", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003087_2", "62003713_2",
"62003087_2", "62003713_1", "62003713_1", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003713_2", "62003087_1",
"62003713_1", "62003087_1", "62003087_2"), IslandCode = c(1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L
), totalAbun = c(667L, 174L, 667L, 667L, 715L, 667L, 715L, 667L,
667L, 1365L, 715L, 715L, 667L, 667L, 715L, 1365L, 174L, 667L,
667L, 667L, 174L, 667L, 715L, 715L, 715L, 715L, 667L, 715L, 667L,
1365L, 667L, 1365L, 715L, 174L, 174L, 667L, 715L, 1365L, 715L,
715L, 715L, 174L, 174L, 667L, 667L, 667L, 715L, 1365L, 667L,
174L, 715L, 715L, 667L, 715L, 667L, 174L, 174L, 1365L, 1365L,
715L, 1365L, 1365L, 174L, 174L, 667L, 715L, 667L, 667L, 667L,
715L, 667L, 1365L, 715L, 715L, 1365L, 1365L, 715L, 174L, 1365L,
174L, 174L, 1365L, 1365L, 1365L, 667L, 174L, 667L, 174L, 715L,
715L, 174L, 1365L, 1365L, 1365L, 667L, 667L, 1365L, 715L, 1365L,
174L)), .Names = c("SurveyID_Block", "IslandCode", "totalAbun"
), row.names = c(NA, 100L), class = "data.frame")
我们可以通过 'SurveyID_Block' split
数据集来创建 list
个 data.frame
的数据集。最好将数据集保存在 list
中,而不是在全局环境中创建单个 data.frame 对象。
lst <- split(df1, df1$SurveyID_Block)
但是,如果我们需要创建单独的数据集,可以使用 list2env
来完成
list2env(setNames(lst, paste0('dfN', seq_along(lst))),
envir=.GlobalEnv)
我有一个包含大约 300,000 行和 60 列的大型数据集。目前,如果我想在我的一个变量中对唯一特征进行子集化,我会使用 unique()
函数来创建一个包含该变量中所有唯一值的 data.frame
列表。然后我将它与主数据框匹配以从我的主文件中获取关联数据。
但是这个过程有点麻烦,所以我想知道是否有更快的方法来做同样的事情?例如,是否有一个函数可用于 select 唯一字段以及与这些值相关联的数据?
例如:我想制作一个新的数据框,它只包含唯一的 SurveyID_Block ID 及其相关的岛屿代码和丰度。
structure(list(SurveyID_Block = c("62003713_2", "62003087_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003713_1",
"62003713_2", "62003713_2", "62003087_1", "62003713_1", "62003713_1",
"62003713_2", "62003713_2", "62003713_1", "62003087_1", "62003087_2",
"62003713_2", "62003713_2", "62003713_2", "62003087_2", "62003713_2",
"62003713_1", "62003713_1", "62003713_1", "62003713_1", "62003713_2",
"62003713_1", "62003713_2", "62003087_1", "62003713_2", "62003087_1",
"62003713_1", "62003087_2", "62003087_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_1", "62003713_1", "62003713_1", "62003087_2",
"62003087_2", "62003713_2", "62003713_2", "62003713_2", "62003713_1",
"62003087_1", "62003713_2", "62003087_2", "62003713_1", "62003713_1",
"62003713_2", "62003713_1", "62003713_2", "62003087_2", "62003087_2",
"62003087_1", "62003087_1", "62003713_1", "62003087_1", "62003087_1",
"62003087_2", "62003087_2", "62003713_2", "62003713_1", "62003713_2",
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003087_1",
"62003713_1", "62003713_1", "62003087_1", "62003087_1", "62003713_1",
"62003087_2", "62003087_1", "62003087_2", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003087_2", "62003713_2",
"62003087_2", "62003713_1", "62003713_1", "62003087_2", "62003087_1",
"62003087_1", "62003087_1", "62003713_2", "62003713_2", "62003087_1",
"62003713_1", "62003087_1", "62003087_2"), IslandCode = c(1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L,
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L
), totalAbun = c(667L, 174L, 667L, 667L, 715L, 667L, 715L, 667L,
667L, 1365L, 715L, 715L, 667L, 667L, 715L, 1365L, 174L, 667L,
667L, 667L, 174L, 667L, 715L, 715L, 715L, 715L, 667L, 715L, 667L,
1365L, 667L, 1365L, 715L, 174L, 174L, 667L, 715L, 1365L, 715L,
715L, 715L, 174L, 174L, 667L, 667L, 667L, 715L, 1365L, 667L,
174L, 715L, 715L, 667L, 715L, 667L, 174L, 174L, 1365L, 1365L,
715L, 1365L, 1365L, 174L, 174L, 667L, 715L, 667L, 667L, 667L,
715L, 667L, 1365L, 715L, 715L, 1365L, 1365L, 715L, 174L, 1365L,
174L, 174L, 1365L, 1365L, 1365L, 667L, 174L, 667L, 174L, 715L,
715L, 174L, 1365L, 1365L, 1365L, 667L, 667L, 1365L, 715L, 1365L,
174L)), .Names = c("SurveyID_Block", "IslandCode", "totalAbun"
), row.names = c(NA, 100L), class = "data.frame")
我们可以通过 'SurveyID_Block' split
数据集来创建 list
个 data.frame
的数据集。最好将数据集保存在 list
中,而不是在全局环境中创建单个 data.frame 对象。
lst <- split(df1, df1$SurveyID_Block)
但是,如果我们需要创建单独的数据集,可以使用 list2env
list2env(setNames(lst, paste0('dfN', seq_along(lst))),
envir=.GlobalEnv)