使用通用名称对列重新排序 - dplyr

Reordering columns using common names - dplyr

我的数据来自一个数据库,根据我 运行 的时间,我的 SQL 查询可能包含一周到另一周不同的 POS 值。

不知道哪些值将在变量中使得自动创建报告变得非常困难。

我的数据如下:

sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))

我需要将此数据框旋转得更宽,以便每个销售点都有一列按成本(总成本和净成本)列出。

这可以使用 pivot_wider 轻松实现:

x <- sample %>% pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST))

Objective 我希望能够将每个 POS 的列保持在一起,即 GROSS_COST_Hospital 和 NET_COST_Hospital 将并排放置,类似于所有其他 POS 列。

是否有使用字符串匹配对列进行分组的优雅方法?

不幸的是,我认为没有直接的解决方案(目前!)。参见 https://github.com/tidyverse/tidyr/issues/839

现在您可以获得长格式的数据,这样您就可以按照您想要的方式控制它们的排序。

library(tidyr)

sample %>%
  pivot_longer(cols = c(GROSS_COST, NET_COST)) %>%
  pivot_wider(names_from = c(name, POS), values_from = value)

#   DRUG  GROSS_COST_Hosp… NET_COST_Hospit… GROSS_COST_Phys… NET_COST_Physic…
#  <chr>            <dbl>            <dbl>            <dbl>            <dbl>
#1 A                   50               45              100               80
#2 B                   NA               NA               NA               NA
# … with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>

我们可以在 select 步骤

上进行排序
library(dplyr)
library(tidyr)
library(stringr)
sample %>% 
  pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST)) %>% 
  select(DRUG, names(.)[-1][order(str_extract(names(.)[-1], '[^_]+$'))])
# A tibble: 2 x 7
#  DRUG  GROSS_COST_Home NET_COST_Home GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physician NET_COST_Physician
#  <chr>           <dbl>         <dbl>               <dbl>             <dbl>                <dbl>              <dbl>
#1 A                  NA            NA                  50                45                  100                 80
#2 B                  60            40                  NA                NA                   NA                 NA

data.table 选项使用 dcast + melt

> dcast(melt(setDT(sample), id.vars = c("DRUG", "POS")), DRUG ~ variable + POS)
   DRUG GROSS_COST_Home GROSS_COST_Hospital GROSS_COST_Physician NET_COST_Home
1:    A              NA                  50                  100            NA
2:    B              60                  NA                   NA            40
   NET_COST_Hospital NET_COST_Physician
1:                45                 80
2:                NA                 NA

随着tidyr 1.2.0的出现,问题终于解决了,您可以直接使用names_vary参数

library(tidyr)
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))

sample %>% 
  pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST), names_vary = 'slowest')
#> # A tibble: 2 x 7
#>   DRUG  GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physi~ NET_COST_Physic~
#>   <chr>               <dbl>             <dbl>             <dbl>            <dbl>
#> 1 A                      50                45               100               80
#> 2 B                      NA                NA                NA               NA
#> # ... with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>

reprex package (v2.0.1)

于 2022-02-18 创建