在 R 中，当有多个列表需要匹配时，如何合并两个数据帧？

Question

20220418 次更新

我将我的数据框改得更像真实的

“NA”和“-3”表示缺失值


> dfA
# A tibble: 6 × 5
  city  name  bloodtype pulse20 pulse21
  <chr> <chr> <chr>       <dbl>   <dbl>
1 CityA Angel A              78      79
2 CityA Bob   B              90      91
3 CityB Cathy A              60      64
4 CityB Dean  B              70      71
5 CityC Ellen O              60      -3
6 CityC Faye  AB             75      -3


> dfB
# A tibble: 3 × 4
  city  name  bloodtype pulse21
  <chr> <chr> <chr>       <dbl>
1 CityC Ellen O              65
2 CityC Faye  AB             79
3 CityC Gaven O              68

我用join把它们合并成df_joined


library(dplyr)
df_joined <- 
  dfA %>% 
  full_join(dfB, by = c("city", "name"), suffix = c("", "_repla"))
df_joined
#”repla”stands for “replacement”


   > df_joined
# A tibble: 7 × 7
  city  name  bloodtype pulse20 pulse21 bloodtype_repla pulse21_repla
  <chr> <chr> <chr>       <dbl>   <dbl> <chr>                   <dbl>
1 CityA Angel A              78      79 NA                         NA
2 CityA Bob   B              90      91 NA                         NA
3 CityB Cathy A              60      64 NA                         NA
4 CityB Dean  B              70      71 NA                         NA
5 CityC Ellen O              60      -3 O                          65
6 CityC Faye  AB             75      -3 AB                         79
7 CityC Gaven NA             NA      NA O                          68

我可以一个一个地改变它们，但是有更多的“.repla”列，比如 100+

那么匹配相似列名并对其进行变异的有效方法是什么，例如，填充来自“formercolumnnames.repla”的所有新数据到“以前的列名”

我查看了across()的帮助文档，但还是不太明白如何把它连接成一个清晰的方式。谢谢你的帮助^^

20220417前题

我有2个数据框，

dfA 是一个很大的数据集，包括所有城市和 2020-2021 年的所有健康数据，除了 c 市 2021 年的健康数据被标记为“-3”。

dfA

城市名称 Pulse20 Pulse21
城市 A 艾米 77 78
CityB 鲍勃 80 79
CityC 凯茜 79 -3

dfB是一个很小的包含我要填入dfA的数据

dfB

城市名称 Pulse21
CityC 凯茜 80

请求： 1.how将这两个dataframe以通用的方式组合起来？

2.if 我使用“full_join”，Pulse21会被列为“Pulse21.x””Pulse21.y”，因此我需要做更多的绑定工作

3.For记录，在我的真实数据中，每个城市有500多人，并且健康数据会像 100 和更多。

那我还有什么办法可以让它更简单高效吗？非常感谢！

Answer 1

dplyr::rows_update(dfA, dfB, c('City', 'Name'))

   City  Name Pulse20 Pulse21
1 CityA   Amy      77      78
2 CityB   Bob      80      79
3 CityC Cathy      79      80

编辑：

对于您的新数据，您似乎在更新原始行的同时插入新行。您可以使用 row_upsert 即 update + insert:

dplyr::rows_upsert(dfA, dfB, c('city', 'name'))

# A tibble: 7 x 5
  city  name  bloodtype pulse20 pulse21
* <chr> <chr> <chr>       <int>   <dbl>
1 CityA Angel A              78      79
2 CityA Bob   B              90      91
3 CityB Cathy A              60      64
4 CityB Dean  B              70      71
5 CityC Ellen O              60      65
6 CityC Faye  AB             75      79
7 CityC Gaven O              NA      68

Answer 2

有几种方法可以做到这一点。

其中大部分依赖于您首先将 -3 值更改为 NA 值，例如通过 mutate(across(where(is.numeric), ~ifelse(.x == -3, NA_real_, .x))).

下面是一些例子。

`dplyr::rows_upsert`

到目前为止，最简单的方法是使用 dplyr::rows_upsert() 旋转。这将使用 non-missing 数据更新缺失的行，并插入第一个 df 中不存在的行。

library(dplyr)

dfA %>% 
  mutate(across(where(is.numeric), ~ifelse(.x == -3, NA_real_, .x))) %>% 
  dplyr::rows_upsert(dfB, by = c("city", "name"))
#> # A tibble: 7 × 5
#>   city  name  bloodtype pulse20 pulse21
#>   <chr> <chr> <chr>       <dbl>   <dbl>
#> 1 CityA Angel A              78      79
#> 2 CityA Bob   B              90      91
#> 3 CityB Cathy A              60      64
#> 4 CityB Dean  B              70      71
#> 5 CityC Ellen O              60      65
#> 6 CityC Faye  AB             75      79
#> 7 CityC Gaven O              NA      68

^{由 reprex package (v2.0.1)}

于 2022-04-18 创建

Note that this function is still experimental, and might change with future updates of dplyr.

`full_join()` 和 `pivot_longer()`

如果我们先连接两个后缀不同的数据帧，我们可以让 tidyr::pivot_longer() 为我们合并它们。这将首先创建一个包含 dfA 和 dfB 之间组合的长数据帧，但 na.omit() 确保我们只保留存在 none 个缺失值的值：

library(dplyr)

dfA %>% 
  full_join(dfB, by = c("city", "name"), suffix = c("_A", "_B")) %>% 
  mutate(across(where(is.numeric), ~ifelse(.x == -3, NA_real_, .x))) %>%
  tidyr::pivot_longer(
    ends_with(c("_A", "_B")), 
    names_to = ".value", 
    names_pattern = "(.*)_.*",
    values_drop_na = TRUE
  ) %>% 
  na.omit()
#> # A tibble: 6 × 5
#>   city  name  pulse20 bloodtype pulse21
#>   <chr> <chr>   <dbl> <chr>       <dbl>
#> 1 CityA Angel      78 A              79
#> 2 CityA Bob        90 B              91
#> 3 CityB Cathy      60 A              64
#> 4 CityB Dean       70 B              71
#> 5 CityC Ellen      60 O              65
#> 6 CityC Faye       75 AB             79

^{由 reprex package (v2.0.1)}

于 2022-04-18 创建

Note that this solution removes the rows which only exists in dfB.

`bind_rows()` 和 `summarise()`

首先使用bind_rows()我们可以按行组合两个df。通过对 id 列进行分组，我们可以使用例如median(na.rm = TRUE) 这将为我们删除缺失值：

library(dplyr)

dfA %>% 
  bind_rows(dfB) %>% 
  mutate(across(where(is.numeric), ~ifelse(.x == -3, NA_real_, .x))) %>% 
  group_by(city, name, bloodtype) %>% 
  summarise(
    across(
      everything(),
      ~median(.x, na.rm = TRUE)
    )
  ) %>% 
  ungroup()
#> `summarise()` has grouped output by 'city', 'name'. You can override using the
#> `.groups` argument.
#> # A tibble: 7 × 5
#>   city  name  bloodtype pulse20 pulse21
#>   <chr> <chr> <chr>       <dbl>   <dbl>
#> 1 CityA Angel A              78      79
#> 2 CityA Bob   B              90      91
#> 3 CityB Cathy A              60      64
#> 4 CityB Dean  B              70      71
#> 5 CityC Ellen O              60      65
#> 6 CityC Faye  AB             75      79
#> 7 CityC Gaven O              NA      68

^{由 reprex package (v2.0.1)}

于 2022-04-18 创建

原回答

正如我原来的回答，您可以使用 full_join() 和 mutate() 来修复 NA 和 -3 问题。然而，当你有很多列时，这比上面提到的解决方案更难。

数据

dfA <- tibble::tribble(
  ~city,  ~name,  ~bloodtype, ~pulse20, ~pulse21,
  "CityA", "Angel", "A",              78,      79,
  "CityA", "Bob",   "B",              90,      91,
  "CityB", "Cathy", "A",              60,      64,
  "CityB", "Dean",  "B",              70,      71,
  "CityC", "Ellen", "O",              60,      -3,
  "CityC", "Faye",  "AB",             75,      -3
)

dfB <- tibble::tribble(
  ~city,  ~name,  ~bloodtype, ~pulse21,
  "CityC", "Ellen", "O",              65,
  "CityC", "Faye",  "AB",             79,
  "CityC", "Gaven", "O",              68
)

在 R 中，当有多个列表需要匹配时，如何合并两个数据帧？

In R, how to combine two dataframes while there are multiple lists that need to be match?

r

dplyr

tidyverse

`dplyr::rows_upsert`

`full_join()` 和 `pivot_longer()`

`bind_rows()` 和 `summarise()`

原回答

数据

在 R 中，当有多个列表需要匹配时，如何合并两个数据帧？

In R, how to combine two dataframes while there are multiple lists that need to be match?

r

dplyr

tidyverse

dplyr::rows_upsert

full_join() 和 pivot_longer()

bind_rows() 和 summarise()

原回答

数据

`dplyr::rows_upsert`

`full_join()` 和 `pivot_longer()`

`bind_rows()` 和 `summarise()`