编写发现两个数据帧差异的 R 函数
Writing R Function that finds difference in two Data Frames
我的问题基本上是这样的:我有 2 个数据集预订数据 (BD) 和订单数据 (OD)。 BD 表示已预订但未支付的订单 for/confirmed,而 OD 仅具有已确认订单的记录(因此它们具有 Order_ID)。所以BD的记录比OD多
除了ODtable还有一个叫做Order_ID的列外,它们的结构大致相同。我从两个数据框中创建了一个新的 table,如下所示:
OD
Order_ID Departure_date created_at_date TF_PP
<dbl> <dttm> <dttm> <dbl>
1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
2 792563 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
3 794073 2021-07-17 00:00:00 2021-07-03 00:00:00 7524
4 795797 2021-07-17 00:00:00 2021-07-03 00:00:00 9045
5 796617 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
6 797848 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
BD
Departure_date created_at_date TF_PP
1: 2021-07-17 2021-07-02 9045
2: 2021-07-17 2021-07-02 9045
3: 2021-07-17 2021-07-02 9045
4: 2021-07-17 2021-07-03 9045
5: 2021-07-17 2021-07-03 7524
6: 2021-07-17 2021-07-03 9045
7: 2021-07-17 2021-07-03 9045
8: 2021-07-17 2021-07-04 5142
9: 2021-07-17 2021-07-04 9045
10: 2021-07-17 2021-07-04 10000
问题
我面临的问题是我想写一个函数,对于OD中的EACH Order_ID,取created_at_date和相应的TF_PP 的值并在 BD 中找到一个 TF_PP 比 OD 的 TF_PP 低 > 2000 AND 在 created_at_date AND 与 OD 中的 Order_ID 共享相同的 departure_date。然后函数 return 是一个新的数据帧,它具有两者的差异(差异 > 2000)以及 Order_ID、Departure_Date 和 created_at_date 列。但是,如果 diff < 2000,则该函数不会 return 新数据帧中的函数。
Output:
Order_ID Dep_Date OD(Created_at_Date) TF BD(Created_at_Date) TF Diff
766787 2021-07-17 2021-07-02 9040 2021-07-04 6950 2090
766787 2021-07-17 2021-07-02 9040 2021-07-12 6895 2145
839265 2021-08-20 2021-08-08 12987 2021-08-15 10000 2987
如果我的问题令人困惑或需要进一步澄清,请告诉我
编辑
dput输出
外径:
structure(list(Order_ID = c(792251, 792563, 794073, 795797, 796617,
797848, 798374, 798990, 801121, 801643, 808494, 809900, 810710,
814812, 815040, 815257, 817469, 819219), Departure_date = structure(c(1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625270400, 1625270400, 1625356800, 1625356800, 1625356800, 1625443200,
1625443200, 1625529600, 1625702400, 1625788800, 1625788800, 1625961600,
1625961600, 1625961600, 1626048000, 1626048000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 7524, 9045, 9045, 9045, 9045,
11245, 9045, 11245, 12945, 12945, 12945, 12945, 12945, 12945,
14945, 14945)), row.names = c(NA, -18L), groups = structure(list(
Order_ID = c(792251, 792563, 794073, 795797, 796617, 797848,
798374, 798990, 801121, 801643, 808494, 809900, 810710, 814812,
815040, 815257, 817469, 819219), .rows = structure(list(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
屋宇署:
structure(list(Departure_date = structure(c(1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625184000, 1625270400, 1625270400, 1625270400, 1625270400, 1625356800,
1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 1625356800,
1625356800, 1625443200, 1625443200, 1625443200, 1625529600, 1625529600,
1625529600, 1625529600, 1625529600, 1625529600, 1625702400, 1625702400,
1625702400, 1625702400, 1625702400, 1625702400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 9045, 9045, 7524, 9045, 9045,
9045, 9045, 9045, 9045, 9045, 9045, 9045, 9045, 11245, 9045,
9045, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 11245,
11245, 12945, 12945, 12945)), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000021d05e81ef0>)
'
也许是这样,使用纯 SQL:
library(sqldf)
sqldf("
SELECT *,OD.TF_PP-BD.TF_PP as diff
FROM OD,BD
WHERE(OD.created_at_date>BD.created_at_date)
AND(OD.TF_PP-BD.TF_PP>2000)
AND(BD.Departure_date=OD.Departure_date)")
虽然这个问题被标记为{dplyr},但使用{data.table}更容易完成这种不等连接:
library(data.table)
OD <- as.data.table(OD)
BD <- as.data.table(BD)
OD[BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP i.TF_PP diff
#> 1: 792251 2021-07-17 2021-07-05 9045 11245 2200
#> 2: 792563 2021-07-17 2021-07-05 9045 11245 2200
#> 3: 794073 2021-07-17 2021-07-05 7524 11245 3721
#> 4: 795797 2021-07-17 2021-07-05 9045 11245 2200
#> 5: 796617 2021-07-17 2021-07-05 9045 11245 2200
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 12945 3900
#> 100: 796617 2021-07-17 2021-07-08 9045 12945 3900
#> 101: 797848 2021-07-17 2021-07-08 9045 12945 3900
#> 102: 798374 2021-07-17 2021-07-08 9045 12945 3900
#> 103: 801121 2021-07-17 2021-07-08 9045 12945 3900
由 reprex package (v2.0.1)
于 2021-08-27 创建
对于 {dplyr},我们将使用 full_join
,然后创建 diff
,然后进行过滤:
library(dplyr)
OD %>%
full_join(BD, by = "Departure_date") %>%
mutate(diff = TF_PP.y - TF_PP.x) %>%
filter(created_at_date.x < created_at_date.y,
diff > 2000)
#> # A tibble: 103 x 7
#> # Groups: Order_ID [8]
#> Order_ID Departure_date created_at_date.x TF_PP.x created_at_date.y
#> <dbl> <dttm> <dttm> <dbl> <dttm>
#> 1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-05 00:00:00
#> 2 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 3 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 4 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 5 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 6 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 7 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 8 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 9 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 10 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> # ... with 93 more rows, and 2 more variables: TF_PP.y <dbl>, diff <dbl>
由 reprex package (v2.0.1)
于 2021-08-27 创建
正如评论中指出的那样,{data.table} 方法仅保留 BD
的 created_at_date
。要保留 OD
table 的 created_at_date
我们需要先给它分配一个不同的名称:
OD[, "create_at_date_OD" := created_at_date
][BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP create_at_date_OD i.TF_PP
#> 1: 792251 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 2: 792563 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 3: 794073 2021-07-17 2021-07-05 7524 2021-07-03 11245
#> 4: 795797 2021-07-17 2021-07-05 9045 2021-07-03 11245
#> 5: 796617 2021-07-17 2021-07-05 9045 2021-07-04 11245
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 2021-07-03 12945
#> 100: 796617 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 101: 797848 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 102: 798374 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 103: 801121 2021-07-17 2021-07-08 9045 2021-07-05 12945
#> diff
#> 1: 2200
#> 2: 2200
#> 3: 3721
#> 4: 2200
#> 5: 2200
#> ---
#> 99: 3900
#> 100: 3900
#> 101: 3900
#> 102: 3900
#> 103: 3900
由 reprex package (v0.3.0)
于 2021-08-29 创建
我的问题基本上是这样的:我有 2 个数据集预订数据 (BD) 和订单数据 (OD)。 BD 表示已预订但未支付的订单 for/confirmed,而 OD 仅具有已确认订单的记录(因此它们具有 Order_ID)。所以BD的记录比OD多
除了ODtable还有一个叫做Order_ID的列外,它们的结构大致相同。我从两个数据框中创建了一个新的 table,如下所示:
OD
Order_ID Departure_date created_at_date TF_PP
<dbl> <dttm> <dttm> <dbl>
1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
2 792563 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
3 794073 2021-07-17 00:00:00 2021-07-03 00:00:00 7524
4 795797 2021-07-17 00:00:00 2021-07-03 00:00:00 9045
5 796617 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
6 797848 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
BD
Departure_date created_at_date TF_PP
1: 2021-07-17 2021-07-02 9045
2: 2021-07-17 2021-07-02 9045
3: 2021-07-17 2021-07-02 9045
4: 2021-07-17 2021-07-03 9045
5: 2021-07-17 2021-07-03 7524
6: 2021-07-17 2021-07-03 9045
7: 2021-07-17 2021-07-03 9045
8: 2021-07-17 2021-07-04 5142
9: 2021-07-17 2021-07-04 9045
10: 2021-07-17 2021-07-04 10000
问题
我面临的问题是我想写一个函数,对于OD中的EACH Order_ID,取created_at_date和相应的TF_PP 的值并在 BD 中找到一个 TF_PP 比 OD 的 TF_PP 低 > 2000 AND 在 created_at_date AND 与 OD 中的 Order_ID 共享相同的 departure_date。然后函数 return 是一个新的数据帧,它具有两者的差异(差异 > 2000)以及 Order_ID、Departure_Date 和 created_at_date 列。但是,如果 diff < 2000,则该函数不会 return 新数据帧中的函数。
Output:
Order_ID Dep_Date OD(Created_at_Date) TF BD(Created_at_Date) TF Diff
766787 2021-07-17 2021-07-02 9040 2021-07-04 6950 2090
766787 2021-07-17 2021-07-02 9040 2021-07-12 6895 2145
839265 2021-08-20 2021-08-08 12987 2021-08-15 10000 2987
如果我的问题令人困惑或需要进一步澄清,请告诉我
编辑 dput输出
外径:
structure(list(Order_ID = c(792251, 792563, 794073, 795797, 796617,
797848, 798374, 798990, 801121, 801643, 808494, 809900, 810710,
814812, 815040, 815257, 817469, 819219), Departure_date = structure(c(1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625270400, 1625270400, 1625356800, 1625356800, 1625356800, 1625443200,
1625443200, 1625529600, 1625702400, 1625788800, 1625788800, 1625961600,
1625961600, 1625961600, 1626048000, 1626048000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 7524, 9045, 9045, 9045, 9045,
11245, 9045, 11245, 12945, 12945, 12945, 12945, 12945, 12945,
14945, 14945)), row.names = c(NA, -18L), groups = structure(list(
Order_ID = c(792251, 792563, 794073, 795797, 796617, 797848,
798374, 798990, 801121, 801643, 808494, 809900, 810710, 814812,
815040, 815257, 817469, 819219), .rows = structure(list(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
屋宇署:
structure(list(Departure_date = structure(c(1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625184000, 1625270400, 1625270400, 1625270400, 1625270400, 1625356800,
1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 1625356800,
1625356800, 1625443200, 1625443200, 1625443200, 1625529600, 1625529600,
1625529600, 1625529600, 1625529600, 1625529600, 1625702400, 1625702400,
1625702400, 1625702400, 1625702400, 1625702400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 9045, 9045, 7524, 9045, 9045,
9045, 9045, 9045, 9045, 9045, 9045, 9045, 9045, 11245, 9045,
9045, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 11245,
11245, 12945, 12945, 12945)), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000021d05e81ef0>)
'
也许是这样,使用纯 SQL:
library(sqldf)
sqldf("
SELECT *,OD.TF_PP-BD.TF_PP as diff
FROM OD,BD
WHERE(OD.created_at_date>BD.created_at_date)
AND(OD.TF_PP-BD.TF_PP>2000)
AND(BD.Departure_date=OD.Departure_date)")
虽然这个问题被标记为{dplyr},但使用{data.table}更容易完成这种不等连接:
library(data.table)
OD <- as.data.table(OD)
BD <- as.data.table(BD)
OD[BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP i.TF_PP diff
#> 1: 792251 2021-07-17 2021-07-05 9045 11245 2200
#> 2: 792563 2021-07-17 2021-07-05 9045 11245 2200
#> 3: 794073 2021-07-17 2021-07-05 7524 11245 3721
#> 4: 795797 2021-07-17 2021-07-05 9045 11245 2200
#> 5: 796617 2021-07-17 2021-07-05 9045 11245 2200
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 12945 3900
#> 100: 796617 2021-07-17 2021-07-08 9045 12945 3900
#> 101: 797848 2021-07-17 2021-07-08 9045 12945 3900
#> 102: 798374 2021-07-17 2021-07-08 9045 12945 3900
#> 103: 801121 2021-07-17 2021-07-08 9045 12945 3900
由 reprex package (v2.0.1)
于 2021-08-27 创建对于 {dplyr},我们将使用 full_join
,然后创建 diff
,然后进行过滤:
library(dplyr)
OD %>%
full_join(BD, by = "Departure_date") %>%
mutate(diff = TF_PP.y - TF_PP.x) %>%
filter(created_at_date.x < created_at_date.y,
diff > 2000)
#> # A tibble: 103 x 7
#> # Groups: Order_ID [8]
#> Order_ID Departure_date created_at_date.x TF_PP.x created_at_date.y
#> <dbl> <dttm> <dttm> <dbl> <dttm>
#> 1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-05 00:00:00
#> 2 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 3 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 4 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 5 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 6 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 7 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 8 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 9 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 10 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> # ... with 93 more rows, and 2 more variables: TF_PP.y <dbl>, diff <dbl>
由 reprex package (v2.0.1)
于 2021-08-27 创建正如评论中指出的那样,{data.table} 方法仅保留 BD
的 created_at_date
。要保留 OD
table 的 created_at_date
我们需要先给它分配一个不同的名称:
OD[, "create_at_date_OD" := created_at_date
][BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP create_at_date_OD i.TF_PP
#> 1: 792251 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 2: 792563 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 3: 794073 2021-07-17 2021-07-05 7524 2021-07-03 11245
#> 4: 795797 2021-07-17 2021-07-05 9045 2021-07-03 11245
#> 5: 796617 2021-07-17 2021-07-05 9045 2021-07-04 11245
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 2021-07-03 12945
#> 100: 796617 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 101: 797848 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 102: 798374 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 103: 801121 2021-07-17 2021-07-08 9045 2021-07-05 12945
#> diff
#> 1: 2200
#> 2: 2200
#> 3: 3721
#> 4: 2200
#> 5: 2200
#> ---
#> 99: 3900
#> 100: 3900
#> 101: 3900
#> 102: 3900
#> 103: 3900
由 reprex package (v0.3.0)
于 2021-08-29 创建