在 R 中与 dplyr 连接时嵌套重复变量
Nesting duplicate variables when joining with dplyr in R
我正在加入具有重复列的数据框(小标题),我不想加入这些列。下面的示例是我通常会做的(通过 i
,但不是 a
或 b
加入):
library(dplyr)
df1 <- tibble(i = letters[1:3], a = 1:3, b = 4:6)
df2 <- tibble(i = letters[1:3], a = 11:13, b = 14:16)
d <- full_join(df1, df2, by ="i")
d
#> # A tibble: 3 × 5
#> i a.x b.x a.y b.y
#> <chr> <int> <int> <int> <int>
#> 1 a 1 4 11 14
#> 2 b 2 5 12 15
#> 3 c 3 6 13 16
我希望这些重复的变量作为嵌套列表返回,例如下面创建的输出:
tibble(
i = letters[1:3],
a = list(c(1, 11), c(2, 12), c(3, 13)),
b = list(c(4, 14), c(5, 15), c(6, 16))
)
#> # A tibble: 3 × 3
#> i a b
#> <chr> <list> <list>
#> 1 a <dbl [2]> <dbl [2]>
#> 2 b <dbl [2]> <dbl [2]>
#> 3 c <dbl [2]> <dbl [2]>
有没有简单的方法来做这样的事情?
除此之外,我一直在尝试(但未成功)各种 stringr 和 tidyr 方法。这是一个引发错误的示例:
library(stringr)
library(tidyr)
# Find any variables with .x or .y
dup_var <- d %>% select(matches("\.[xy]")) %>% names()
# Condense to the stems (original names) of these variables
dup_var_stems <- dup_var %>% str_replace("(\.[x|y])+", "") %>% unique()
# For each stem, try to nest relevant data into a single variable
for (stem in dup_var_stems) {
d <- d %>% nest_(key_col = stem, nest_cols = names(d)[str_detect(names(d), paste0(stem, "[$|\.]"))])
}
更新
在@Sotos 和@conor 的回答之后,我会提到该解决方案需要推广到许多数据帧上的多个连接和重复列。下面是一个示例,其中按两列(i
和 j
)对五个数据帧进行连接。这将创建列 a
和 b
的五个重复版本,还有大量独特的列 c
:g
。一个问题是复制如此多的数据帧会导致复制版本没有后缀 .x
、.x.x
等。 .x|.y
的简单正则表达式匹配将错过列的 no-suffix 版本。
library(dplyr)
library(purrr)
id_cols <- tibble(i = c("x", "x", "y", "y"),
j = c(1, 2, 1, 2))
df1 <- id_cols %>% cbind(tibble(a = 1:4, b = 5:8, c = 21:24))
df2 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, d = 31:34))
df3 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, e = 31:34))
df4 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, f = 31:34))
df5 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, g = 31:34))
datalist <- list(df1, df2, df3, df4, df5)
d <- reduce(datalist, full_join, by = c("i", "j"))
d
#> i j a.x b.x c a.y b.y d a.x.x b.x.x e a.y.y b.y.y f a b g
#> 1 x 1 1 5 21 2 6 31 2 6 31 2 6 31 2 6 31
#> 2 x 2 2 6 22 3 7 32 3 7 32 3 7 32 3 7 32
#> 3 y 1 3 7 23 4 8 33 4 8 33 4 8 33 4 8 33
#> 4 y 2 4 8 24 5 9 34 5 9 34 5 9 34 5 9 34
这是一次尝试,
library(dplyr)
library(tidyr)
melt(d, id.vars = 'i') %>%
group_by(a = sub('\..*', '', variable), i) %>%
summarise(new = list(value)) %>%
spread(a, new)
# A tibble: 3 × 3
# i a b
#* <chr> <list> <list>
#1 a <int [2]> <int [2]>
#2 b <int [2]> <int [2]>
#3 c <int [2]> <int [2]>
#With structure
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 3 variables:
$ i: chr "a" "b" "c"
$ a:List of 3
..$ : int 1 11
..$ : int 2 12
..$ : int 3 13
$ b:List of 3
..$ : int 4 14
..$ : int 5 15
..$ : int 6 16
#Or via reshape2 package
library(dplyr)
library(reshape2)
d1 <- melt(d, id.vars = 'i') %>%
group_by(a = sub('\..*', '', variable), i) %>%
summarise(new = list(value))
d2 <- dcast(d1, i ~ a, value.var = 'new')
#d2
# i a b
#1 a 1, 11 4, 14
#2 b 2, 12 5, 15
#3 c 3, 13 6, 16
#with structure:
str(d2)
'data.frame': 3 obs. of 3 variables:
$ i: chr "a" "b" "c"
$ a:List of 3
..$ : int 1 11
..$ : int 2 12
..$ : int 3 13
$ b:List of 3
..$ : int 4 14
..$ : int 5 15
..$ : int 6 16
编辑
跟随你的想法,
library(dplyr)
library(reshape2)
library(purrr)
library(tidyr)
df <- melt(d, id.vars = c(names(d)[!grepl('a|b', names(d))]))
dots <- names(df)[!grepl('value', names(df))] %>% map(as.symbol)
df %>% mutate(variable = sub('\..*', '', variable)) %>%
group_by_(.dots = dots) %>%
summarise(new = list(value)) %>%
spread(variable, new) %>%
ungroup()
# A tibble: 4 × 9
# i j c d e f g a b
#* <chr> <dbl> <int> <int> <int> <int> <int> <list> <list>
#1 x 1 21 31 31 31 31 <int [5]> <int [5]>
#2 x 2 22 32 32 32 32 <int [5]> <int [5]>
#3 y 1 23 33 33 33 33 <int [5]> <int [5]>
#4 y 2 24 34 34 34 34 <int [5]> <int [5]>
比 Sotos
的答案稍微冗长一些,但这也可以。
library(dplyr)
library(tidyr)
library(stringr)
d_tidy <- gather(d, col, val, a.x:b.y, -i)
d_tidy$col <- str_replace(d_tidy$col, ".x|.y", "")
d_tidy %>% group_by(i, col) %>%
summarise(val = list(val)) %>%
spread(col, val) %>%
ungroup()
i a b
<fctr> <list> <list>
1 a <int [2]> <int [2]>
2 b <int [2]> <int [2]>
3 c <int [2]> <int [2]>
如果您想使用 nest
创建 dataframes
的 lists
,您可以改为这样做
d_tidy <- gather(d, col, val, a.x:b.y, -i)
d_tidy$col <- str_replace(d_tidy$col, ".x|.y", "")
d_tidy %>%
group_by(i, col) %>%
nest(col) %>%
spread(col, data)
i a b
<fctr> <list> <list>
1 a <tbl_df [2,0]> <tbl_df [2,0]>
2 b <tbl_df [2,0]> <tbl_df [2,0]>
3 c <tbl_df [2,0]> <tbl_df [2,0]>
更新问题后,我根据@Sotos 提供的 melt()
解决方案得出了以下结论(因此,如果您认为可行,请也对该解决方案投赞成票)。
以下是一个函数,它应该像所描述的那样采用数据框,并嵌套重复的列。请参阅整个评论以获取解释。
创建问题数据框:
library(dplyr)
library(purrr)
id_cols <- tibble(i = c("x", "x", "y", "y"),
j = c(1, 2, 1, 2))
df1 <- id_cols %>% cbind(tibble(a = 1:4, b = 5:8, c = 21:24))
df2 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, d = 31:34))
df3 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, e = 31:34))
df4 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, f = 31:34))
df5 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, g = 31:34))
datalist <- list(df1, df2, df3, df4, df5)
d <- reduce(datalist, full_join, by = c("i", "j"))
d
#> i j a.x b.x c a.y b.y d a.x.x b.x.x e a.y.y b.y.y f a b g
#> 1 x 1 1 5 21 2 6 31 2 6 31 2 6 31 2 6 31
#> 2 x 2 2 6 22 3 7 32 3 7 32 3 7 32 3 7 32
#> 3 y 1 3 7 23 4 8 33 4 8 33 4 8 33 4 8 33
#> 4 y 2 4 8 24 5 9 34 5 9 34 5 9 34 5 9 34
创建函数nest_duplicates()
# Function to nest duplicated columns after joining multiple data frames
#
# Args:
# df Data frame of joined data frames with duplicated columns.
# suffixes Character string to match suffixes. E.g., the default "\.[xy]"
# finds any columns ending with .x or .y
#
# Depends on: dplyr, tidyr, purrr, stringr
nest_duplicated <- function(df, suffixes = "\.[xy]") {
# Search string to match any duplicated variables
search_string <- df %>%
dplyr::select(dplyr::matches(suffixes)) %>%
names() %>%
stringr::str_replace_all(suffixes, "") %>%
unique() %>%
stringr::str_c(collapse = "|") %>%
stringr::str_c("(", ., ")($|", suffixes, ")")
# Gather duplicated variables and convert names to stems
df <- df %>%
tidyr::gather(variable, value, dplyr::matches(search_string)) %>%
dplyr::mutate(variable = stringr::str_replace_all(variable, suffixes, ""))
# Group by all columns except value to convert duplicated rows into list, then
# spread by variable (var)
dots <- names(df)[!stringr::str_detect(names(df), "value")] %>% purrr::map(as.symbol)
df %>%
dplyr::group_by_(.dots = dots) %>%
dplyr::summarise(new = list(value)) %>%
tidyr::spread(variable, new) %>%
dplyr::ungroup()
}
应用nest_duplicates()
:
nest_duplicated(d)
#> # A tibble: 4 × 9
#> i j c d e f g a b
#> * <chr> <dbl> <int> <int> <int> <int> <int> <list> <list>
#> 1 x 1 21 31 31 31 31 <int [5]> <int [5]>
#> 2 x 2 22 32 32 32 32 <int [5]> <int [5]>
#> 3 y 1 23 33 33 33 33 <int [5]> <int [5]>
#> 4 y 2 24 34 34 34 34 <int [5]> <int [5]>
Updates/improvements 欢迎!
我正在加入具有重复列的数据框(小标题),我不想加入这些列。下面的示例是我通常会做的(通过 i
,但不是 a
或 b
加入):
library(dplyr)
df1 <- tibble(i = letters[1:3], a = 1:3, b = 4:6)
df2 <- tibble(i = letters[1:3], a = 11:13, b = 14:16)
d <- full_join(df1, df2, by ="i")
d
#> # A tibble: 3 × 5
#> i a.x b.x a.y b.y
#> <chr> <int> <int> <int> <int>
#> 1 a 1 4 11 14
#> 2 b 2 5 12 15
#> 3 c 3 6 13 16
我希望这些重复的变量作为嵌套列表返回,例如下面创建的输出:
tibble(
i = letters[1:3],
a = list(c(1, 11), c(2, 12), c(3, 13)),
b = list(c(4, 14), c(5, 15), c(6, 16))
)
#> # A tibble: 3 × 3
#> i a b
#> <chr> <list> <list>
#> 1 a <dbl [2]> <dbl [2]>
#> 2 b <dbl [2]> <dbl [2]>
#> 3 c <dbl [2]> <dbl [2]>
有没有简单的方法来做这样的事情?
除此之外,我一直在尝试(但未成功)各种 stringr 和 tidyr 方法。这是一个引发错误的示例:
library(stringr)
library(tidyr)
# Find any variables with .x or .y
dup_var <- d %>% select(matches("\.[xy]")) %>% names()
# Condense to the stems (original names) of these variables
dup_var_stems <- dup_var %>% str_replace("(\.[x|y])+", "") %>% unique()
# For each stem, try to nest relevant data into a single variable
for (stem in dup_var_stems) {
d <- d %>% nest_(key_col = stem, nest_cols = names(d)[str_detect(names(d), paste0(stem, "[$|\.]"))])
}
更新
在@Sotos 和@conor 的回答之后,我会提到该解决方案需要推广到许多数据帧上的多个连接和重复列。下面是一个示例,其中按两列(i
和 j
)对五个数据帧进行连接。这将创建列 a
和 b
的五个重复版本,还有大量独特的列 c
:g
。一个问题是复制如此多的数据帧会导致复制版本没有后缀 .x
、.x.x
等。 .x|.y
的简单正则表达式匹配将错过列的 no-suffix 版本。
library(dplyr)
library(purrr)
id_cols <- tibble(i = c("x", "x", "y", "y"),
j = c(1, 2, 1, 2))
df1 <- id_cols %>% cbind(tibble(a = 1:4, b = 5:8, c = 21:24))
df2 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, d = 31:34))
df3 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, e = 31:34))
df4 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, f = 31:34))
df5 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, g = 31:34))
datalist <- list(df1, df2, df3, df4, df5)
d <- reduce(datalist, full_join, by = c("i", "j"))
d
#> i j a.x b.x c a.y b.y d a.x.x b.x.x e a.y.y b.y.y f a b g
#> 1 x 1 1 5 21 2 6 31 2 6 31 2 6 31 2 6 31
#> 2 x 2 2 6 22 3 7 32 3 7 32 3 7 32 3 7 32
#> 3 y 1 3 7 23 4 8 33 4 8 33 4 8 33 4 8 33
#> 4 y 2 4 8 24 5 9 34 5 9 34 5 9 34 5 9 34
这是一次尝试,
library(dplyr)
library(tidyr)
melt(d, id.vars = 'i') %>%
group_by(a = sub('\..*', '', variable), i) %>%
summarise(new = list(value)) %>%
spread(a, new)
# A tibble: 3 × 3
# i a b
#* <chr> <list> <list>
#1 a <int [2]> <int [2]>
#2 b <int [2]> <int [2]>
#3 c <int [2]> <int [2]>
#With structure
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 3 variables:
$ i: chr "a" "b" "c"
$ a:List of 3
..$ : int 1 11
..$ : int 2 12
..$ : int 3 13
$ b:List of 3
..$ : int 4 14
..$ : int 5 15
..$ : int 6 16
#Or via reshape2 package
library(dplyr)
library(reshape2)
d1 <- melt(d, id.vars = 'i') %>%
group_by(a = sub('\..*', '', variable), i) %>%
summarise(new = list(value))
d2 <- dcast(d1, i ~ a, value.var = 'new')
#d2
# i a b
#1 a 1, 11 4, 14
#2 b 2, 12 5, 15
#3 c 3, 13 6, 16
#with structure:
str(d2)
'data.frame': 3 obs. of 3 variables:
$ i: chr "a" "b" "c"
$ a:List of 3
..$ : int 1 11
..$ : int 2 12
..$ : int 3 13
$ b:List of 3
..$ : int 4 14
..$ : int 5 15
..$ : int 6 16
编辑
跟随你的想法,
library(dplyr)
library(reshape2)
library(purrr)
library(tidyr)
df <- melt(d, id.vars = c(names(d)[!grepl('a|b', names(d))]))
dots <- names(df)[!grepl('value', names(df))] %>% map(as.symbol)
df %>% mutate(variable = sub('\..*', '', variable)) %>%
group_by_(.dots = dots) %>%
summarise(new = list(value)) %>%
spread(variable, new) %>%
ungroup()
# A tibble: 4 × 9
# i j c d e f g a b
#* <chr> <dbl> <int> <int> <int> <int> <int> <list> <list>
#1 x 1 21 31 31 31 31 <int [5]> <int [5]>
#2 x 2 22 32 32 32 32 <int [5]> <int [5]>
#3 y 1 23 33 33 33 33 <int [5]> <int [5]>
#4 y 2 24 34 34 34 34 <int [5]> <int [5]>
比 Sotos
的答案稍微冗长一些,但这也可以。
library(dplyr)
library(tidyr)
library(stringr)
d_tidy <- gather(d, col, val, a.x:b.y, -i)
d_tidy$col <- str_replace(d_tidy$col, ".x|.y", "")
d_tidy %>% group_by(i, col) %>%
summarise(val = list(val)) %>%
spread(col, val) %>%
ungroup()
i a b
<fctr> <list> <list>
1 a <int [2]> <int [2]>
2 b <int [2]> <int [2]>
3 c <int [2]> <int [2]>
如果您想使用 nest
创建 dataframes
的 lists
,您可以改为这样做
d_tidy <- gather(d, col, val, a.x:b.y, -i)
d_tidy$col <- str_replace(d_tidy$col, ".x|.y", "")
d_tidy %>%
group_by(i, col) %>%
nest(col) %>%
spread(col, data)
i a b
<fctr> <list> <list>
1 a <tbl_df [2,0]> <tbl_df [2,0]>
2 b <tbl_df [2,0]> <tbl_df [2,0]>
3 c <tbl_df [2,0]> <tbl_df [2,0]>
更新问题后,我根据@Sotos 提供的 melt()
解决方案得出了以下结论(因此,如果您认为可行,请也对该解决方案投赞成票)。
以下是一个函数,它应该像所描述的那样采用数据框,并嵌套重复的列。请参阅整个评论以获取解释。
创建问题数据框:
library(dplyr)
library(purrr)
id_cols <- tibble(i = c("x", "x", "y", "y"),
j = c(1, 2, 1, 2))
df1 <- id_cols %>% cbind(tibble(a = 1:4, b = 5:8, c = 21:24))
df2 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, d = 31:34))
df3 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, e = 31:34))
df4 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, f = 31:34))
df5 <- id_cols %>% cbind(tibble(a = 2:5, b = 6:9, g = 31:34))
datalist <- list(df1, df2, df3, df4, df5)
d <- reduce(datalist, full_join, by = c("i", "j"))
d
#> i j a.x b.x c a.y b.y d a.x.x b.x.x e a.y.y b.y.y f a b g
#> 1 x 1 1 5 21 2 6 31 2 6 31 2 6 31 2 6 31
#> 2 x 2 2 6 22 3 7 32 3 7 32 3 7 32 3 7 32
#> 3 y 1 3 7 23 4 8 33 4 8 33 4 8 33 4 8 33
#> 4 y 2 4 8 24 5 9 34 5 9 34 5 9 34 5 9 34
创建函数nest_duplicates()
# Function to nest duplicated columns after joining multiple data frames
#
# Args:
# df Data frame of joined data frames with duplicated columns.
# suffixes Character string to match suffixes. E.g., the default "\.[xy]"
# finds any columns ending with .x or .y
#
# Depends on: dplyr, tidyr, purrr, stringr
nest_duplicated <- function(df, suffixes = "\.[xy]") {
# Search string to match any duplicated variables
search_string <- df %>%
dplyr::select(dplyr::matches(suffixes)) %>%
names() %>%
stringr::str_replace_all(suffixes, "") %>%
unique() %>%
stringr::str_c(collapse = "|") %>%
stringr::str_c("(", ., ")($|", suffixes, ")")
# Gather duplicated variables and convert names to stems
df <- df %>%
tidyr::gather(variable, value, dplyr::matches(search_string)) %>%
dplyr::mutate(variable = stringr::str_replace_all(variable, suffixes, ""))
# Group by all columns except value to convert duplicated rows into list, then
# spread by variable (var)
dots <- names(df)[!stringr::str_detect(names(df), "value")] %>% purrr::map(as.symbol)
df %>%
dplyr::group_by_(.dots = dots) %>%
dplyr::summarise(new = list(value)) %>%
tidyr::spread(variable, new) %>%
dplyr::ungroup()
}
应用nest_duplicates()
:
nest_duplicated(d)
#> # A tibble: 4 × 9
#> i j c d e f g a b
#> * <chr> <dbl> <int> <int> <int> <int> <int> <list> <list>
#> 1 x 1 21 31 31 31 31 <int [5]> <int [5]>
#> 2 x 2 22 32 32 32 32 <int [5]> <int [5]>
#> 3 y 1 23 33 33 33 33 <int [5]> <int [5]>
#> 4 y 2 24 34 34 34 34 <int [5]> <int [5]>
Updates/improvements 欢迎!