编写发现两个数据帧差异的 R 函数

Writing R Function that finds difference in two Data Frames

我的问题基本上是这样的:我有 2 个数据集预订数据 (BD) 和订单数据 (OD)。 BD 表示已预订但未支付的订单 for/confirmed,而 OD 仅具有已确认订单的记录(因此它们具有 Order_ID)。所以BD的记录比OD多

除了ODtable还有一个叫做Order_ID的列外,它们的结构大致相同。我从两个数据框中创建了一个新的 table,如下所示:

OD
Order_ID Departure_date      created_at_date     TF_PP
     <dbl> <dttm>              <dttm>              <dbl>
1   792251 2021-07-17 00:00:00 2021-07-02 00:00:00  9045
2   792563 2021-07-17 00:00:00 2021-07-02 00:00:00  9045
3   794073 2021-07-17 00:00:00 2021-07-03 00:00:00  7524
4   795797 2021-07-17 00:00:00 2021-07-03 00:00:00  9045
5   796617 2021-07-17 00:00:00 2021-07-04 00:00:00  9045
6   797848 2021-07-17 00:00:00 2021-07-04 00:00:00  9045
BD       
Departure_date         created_at_date     TF_PP
1:     2021-07-17      2021-07-02          9045
2:     2021-07-17      2021-07-02          9045
3:     2021-07-17      2021-07-02          9045
4:     2021-07-17      2021-07-03          9045
5:     2021-07-17      2021-07-03          7524
6:     2021-07-17      2021-07-03          9045
7:     2021-07-17      2021-07-03          9045
8:     2021-07-17      2021-07-04          5142
9:     2021-07-17      2021-07-04          9045
10:    2021-07-17      2021-07-04          10000

问题

我面临的问题是我想写一个函数,对于OD中的EACH Order_ID,取created_at_date和相应的TF_PP 的值并在 BD 中找到一个 TF_PP 比 OD 的 TF_PP 低 > 2000 AND 在 created_at_date AND 与 OD 中的 Order_ID 共享相同的 departure_date。然后函数 return 是一个新的数据帧,它具有两者的差异(差异 > 2000)以及 Order_ID、Departure_Date 和 created_at_date 列。但是,如果 diff < 2000,则该函数不会 return 新数据帧中的函数。

Output:
Order_ID    Dep_Date    OD(Created_at_Date) TF  BD(Created_at_Date) TF   Diff
766787      2021-07-17   2021-07-02        9040     2021-07-04     6950  2090
766787      2021-07-17   2021-07-02        9040     2021-07-12     6895  2145
839265      2021-08-20   2021-08-08        12987    2021-08-15     10000 2987

如果我的问题令人困惑或需要进一步澄清,请告诉我

编辑 dput输出

外径:


structure(list(Order_ID = c(792251, 792563, 794073, 795797, 796617, 
797848, 798374, 798990, 801121, 801643, 808494, 809900, 810710, 
814812, 815040, 815257, 817469, 819219), Departure_date = structure(c(1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000, 
1625270400, 1625270400, 1625356800, 1625356800, 1625356800, 1625443200, 
1625443200, 1625529600, 1625702400, 1625788800, 1625788800, 1625961600, 
1625961600, 1625961600, 1626048000, 1626048000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), TF_PP = c(9045, 9045, 7524, 9045, 9045, 9045, 9045, 
11245, 9045, 11245, 12945, 12945, 12945, 12945, 12945, 12945, 
14945, 14945)), row.names = c(NA, -18L), groups = structure(list(
    Order_ID = c(792251, 792563, 794073, 795797, 796617, 797848, 
    798374, 798990, 801121, 801643, 808494, 809900, 810710, 814812, 
    815040, 815257, 817469, 819219), .rows = structure(list(1L, 
        2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
        15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -18L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

屋宇署:


structure(list(Departure_date = structure(c(1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000, 
1625184000, 1625270400, 1625270400, 1625270400, 1625270400, 1625356800, 
1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 
1625356800, 1625443200, 1625443200, 1625443200, 1625529600, 1625529600, 
1625529600, 1625529600, 1625529600, 1625529600, 1625702400, 1625702400, 
1625702400, 1625702400, 1625702400, 1625702400), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), TF_PP = c(9045, 9045, 9045, 9045, 7524, 9045, 9045, 
9045, 9045, 9045, 9045, 9045, 9045, 9045, 9045, 11245, 9045, 
9045, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 
11245, 12945, 12945, 12945)), row.names = c(NA, -30L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000021d05e81ef0>)
'

也许是这样,使用纯 SQL:

library(sqldf)
sqldf("
SELECT *,OD.TF_PP-BD.TF_PP as diff
FROM OD,BD
WHERE(OD.created_at_date>BD.created_at_date)
  AND(OD.TF_PP-BD.TF_PP>2000)
  AND(BD.Departure_date=OD.Departure_date)")

虽然这个问题被标记为{dplyr},但使用{data.table}更容易完成这种不等连接:

library(data.table)

OD <- as.data.table(OD)
BD <- as.data.table(BD)

OD[BD, 
     on = .(created_at_date < created_at_date, Departure_date = Departure_date)
   ][
     ,`:=`("diff" = i.TF_PP - TF_PP)
   ][
     diff > 2000]

#>      Order_ID Departure_date created_at_date TF_PP i.TF_PP diff
#>   1:   792251     2021-07-17      2021-07-05  9045   11245 2200
#>   2:   792563     2021-07-17      2021-07-05  9045   11245 2200
#>   3:   794073     2021-07-17      2021-07-05  7524   11245 3721
#>   4:   795797     2021-07-17      2021-07-05  9045   11245 2200
#>   5:   796617     2021-07-17      2021-07-05  9045   11245 2200
#>  ---                                                           
#>  99:   795797     2021-07-17      2021-07-08  9045   12945 3900
#> 100:   796617     2021-07-17      2021-07-08  9045   12945 3900
#> 101:   797848     2021-07-17      2021-07-08  9045   12945 3900
#> 102:   798374     2021-07-17      2021-07-08  9045   12945 3900
#> 103:   801121     2021-07-17      2021-07-08  9045   12945 3900

reprex package (v2.0.1)

于 2021-08-27 创建

对于 {dplyr},我们将使用 full_join,然后创建 diff,然后进行过滤:

library(dplyr)

OD %>% 
  full_join(BD, by = "Departure_date") %>% 
  mutate(diff = TF_PP.y - TF_PP.x) %>% 
  filter(created_at_date.x < created_at_date.y,
         diff > 2000)

#> # A tibble: 103 x 7
#> # Groups:   Order_ID [8]
#>    Order_ID Departure_date      created_at_date.x   TF_PP.x created_at_date.y  
#>       <dbl> <dttm>              <dttm>                <dbl> <dttm>             
#>  1   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-05 00:00:00
#>  2   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  3   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  4   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  5   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  6   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  7   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  8   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#>  9   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#> 10   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#> # ... with 93 more rows, and 2 more variables: TF_PP.y <dbl>, diff <dbl>

reprex package (v2.0.1)

于 2021-08-27 创建

正如评论中指出的那样,{data.table} 方法仅保留 BDcreated_at_date。要保留 OD table 的 created_at_date 我们需要先给它分配一个不同的名称:

OD[, "create_at_date_OD" := created_at_date
][BD,  
   on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
  ,`:=`("diff" = i.TF_PP - TF_PP)
][
  diff > 2000]

#>      Order_ID Departure_date created_at_date TF_PP create_at_date_OD i.TF_PP
#>   1:   792251     2021-07-17      2021-07-05  9045        2021-07-02   11245
#>   2:   792563     2021-07-17      2021-07-05  9045        2021-07-02   11245
#>   3:   794073     2021-07-17      2021-07-05  7524        2021-07-03   11245
#>   4:   795797     2021-07-17      2021-07-05  9045        2021-07-03   11245
#>   5:   796617     2021-07-17      2021-07-05  9045        2021-07-04   11245
#>  ---                                                                        
#>  99:   795797     2021-07-17      2021-07-08  9045        2021-07-03   12945
#> 100:   796617     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 101:   797848     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 102:   798374     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 103:   801121     2021-07-17      2021-07-08  9045        2021-07-05   12945
#>      diff
#>   1: 2200
#>   2: 2200
#>   3: 3721
#>   4: 2200
#>   5: 2200
#>  ---     
#>  99: 3900
#> 100: 3900
#> 101: 3900
#> 102: 3900
#> 103: 3900

reprex package (v0.3.0)

于 2021-08-29 创建