根据 R 中的唯一列值创建 data.frame?
Create data.frame based on unique column values in R?
我有一个 data.frame
的观察结果,其中包含元数据列,我想创建一个新的 data.frame
,其中包含相同的列,但每行代表每个列值的唯一组合。这是一个例子:
# what I have
df <- data.frame("Color" = c("Red", "Blue", "Green", "Green"),
"Size" = c("Large", "Large", "Large", "Small"),
"Value" = c(0, 1, 1, 1))
> df
Color Size Value
1 Red Large 0
2 Blue Large 1
3 Green Large 1
4 Green Small 1
# what I want
ideal_df <- data.frame("Color" = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue", "Blue", "Green", "Green", "Green", "Green"),
"Size" = c("Large", "Large", "Small", "Small", "Large", "Large", "Small", "Small", "Large", "Large", "Small", "Small"),
"Value" = c(0,1,0,1,0,1,0,1,0,1,0,1))
> ideal_df
Color Size Value
1 Red Large 0
2 Red Large 1
3 Red Small 0
4 Red Small 1
5 Blue Large 0
6 Blue Large 1
7 Blue Small 0
8 Blue Small 1
9 Green Large 0
10 Green Large 1
11 Green Small 0
12 Green Small 1
我试过使用 for 循环,但我的数据比这个例子大得多,它挂起。我试图搜索这个问题,但找不到类似的东西。如果这个问题已经得到解答,我很乐意查看其他主题!谢谢你的时间。
这是 tidyr
包中 expand()
的工作:
library(tidyr)
new_df <- df %>% expand(Color, Size, Value)
只是添加一个base R
解决方案:
new_df <- expand.grid(Color = unique(df$Color)
, Size = unique(df$Size)
, Value = unique(df$Value))
如果性能是一个问题,这里有一个基准比较:
sandy <- function(){
expand(df, Color, Size, Value)
}
cj <- function(){
expand.grid(Color = unique(df$Color)
, Size = unique(df$Size)
, Value = unique(df$Value))
}
library(microbenchmark)
microbenchmark(sandy(), cj())
Unit: microseconds
expr min lq mean median uq max neval
sandy() 1382.524 1494.675 1693.1749 1562.084 1736.524 7352.916 100
cj() 138.914 152.746 204.8588 173.321 191.910 2889.398 100
我有一个 data.frame
的观察结果,其中包含元数据列,我想创建一个新的 data.frame
,其中包含相同的列,但每行代表每个列值的唯一组合。这是一个例子:
# what I have
df <- data.frame("Color" = c("Red", "Blue", "Green", "Green"),
"Size" = c("Large", "Large", "Large", "Small"),
"Value" = c(0, 1, 1, 1))
> df
Color Size Value
1 Red Large 0
2 Blue Large 1
3 Green Large 1
4 Green Small 1
# what I want
ideal_df <- data.frame("Color" = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue", "Blue", "Green", "Green", "Green", "Green"),
"Size" = c("Large", "Large", "Small", "Small", "Large", "Large", "Small", "Small", "Large", "Large", "Small", "Small"),
"Value" = c(0,1,0,1,0,1,0,1,0,1,0,1))
> ideal_df
Color Size Value
1 Red Large 0
2 Red Large 1
3 Red Small 0
4 Red Small 1
5 Blue Large 0
6 Blue Large 1
7 Blue Small 0
8 Blue Small 1
9 Green Large 0
10 Green Large 1
11 Green Small 0
12 Green Small 1
我试过使用 for 循环,但我的数据比这个例子大得多,它挂起。我试图搜索这个问题,但找不到类似的东西。如果这个问题已经得到解答,我很乐意查看其他主题!谢谢你的时间。
这是 tidyr
包中 expand()
的工作:
library(tidyr)
new_df <- df %>% expand(Color, Size, Value)
只是添加一个base R
解决方案:
new_df <- expand.grid(Color = unique(df$Color)
, Size = unique(df$Size)
, Value = unique(df$Value))
如果性能是一个问题,这里有一个基准比较:
sandy <- function(){
expand(df, Color, Size, Value)
}
cj <- function(){
expand.grid(Color = unique(df$Color)
, Size = unique(df$Size)
, Value = unique(df$Value))
}
library(microbenchmark)
microbenchmark(sandy(), cj())
Unit: microseconds
expr min lq mean median uq max neval
sandy() 1382.524 1494.675 1693.1749 1562.084 1736.524 7352.916 100
cj() 138.914 152.746 204.8588 173.321 191.910 2889.398 100