使用多个条件按组扩大数据

Widen data by group with multiple conditions

我有关于 Jenkins 作业流水线执行的数据,我正在尝试根据数据中的开始和结束时间来确定从开发到生产所需的平均持续时间。数据有点像 t运行sactional 数据库,其中开发管道的执行是一个唯一记录,然后同一管道到生产的执行是另一个唯一记录(仅共享一个分组变量,即团队运行 工作)。

这是我开始使用的数据示例:

  job_id   startTime            endTime               env_type  Team_ID
1  100      8/4/2017 17:14:00   8/4/2017 17:16:00      DEV       A
2  101      8/4/2017 17:20:00   8/4/2017 17:21:00      DEV       A
3  102      8/4/2017 17:24:00   8/4/2017 17:27:00      DEV       B
4  103      8/4/2017 17:38:00   8/4/2017 17:40:00      DEV       B
5  104      8/4/2017 17:40:00   8/4/2017 17:42:00      DEV       C
6  105      8/4/2017 17:51:00   8/4/2017 17:54:00      DEV       C

在我开始扩大数据的第一次尝试中,我使用 mutate 创建新列并根据 env_type:

复制开始和结束时间
df %>%
    mutate(prod_job_id = ifelse(env_type == "PROD", job_id, ""), 
           prod_start_time = ifelse(env_type == "PROD", startTime, ""), 
           prod_end_time = ifelse(env_type == "PROD", endTime, ""),  
           dev_job_id = ifelse(env_type == "DEV", job_id, ""), 
           dev_start_time = ifelse(env_type == "DEV", startTime, ""), 
           dev_end_time = ifelse(env_type == "DEV", endTime, ""))

这让我做了这样的事情(也使用 as.POSIXct 转换时间):

Team_ID env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time
1        A      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
2        A      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
3        A     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
4        A     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
5        B      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
6        B      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
7        B     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
8        C      DEV 2018-08-05 12:00:00 2018-08-05 13:00:00                <NA>                <NA>
9        C      DEV 2018-08-06 12:00:00 2018-08-06 13:00:00                <NA>                <NA>
10       C     TEST 2018-08-06 14:00:00 2018-08-06 15:00:00                <NA>                <NA>

这是输出:

structure(list(Team_ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
), class = "factor"), pipeline_id = c(1000L, 1000L, 1000L, 1000L, 
2000L, 2000L, 2000L, 3000L, 3000L, 3000L, 4000L, 4000L, 5000L, 
5000L), env_type = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
1L, 3L, 1L, 1L, 2L, 2L), .Label = c("DEV", "PROD", "TEST"), class = "factor"), 
    dev_start_time = structure(c(1533142800, 1533229200, NA, 
    NA, 1533142800, 1533229200, NA, 1533488400, 1533574800, 1533582000, 
    1533142800, 1533229200, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = ""), dev_end_time = structure(c(1533146400, 1533232800, 
    NA, NA, 1533146400, 1533232800, NA, 1533492000, 1533578400, 
    1533585600, 1533146400, 1533232800, NA, NA), class = c("POSIXct", 
    "POSIXt"), tzone = ""), prod_start_time = structure(c(NA, 
    NA, 1533236400, 1533243600, NA, NA, 1533243600, NA, NA, NA, 
    NA, NA, 1533236400, 1533243600), class = c("POSIXct", "POSIXt"
    ), tzone = ""), prod_end_time = structure(c(NA, NA, 1533240000, 
    1533247200, NA, NA, 1533247200, NA, NA, NA, NA, NA, 1533240000, 
    1533247200), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA, 
-14L))

棘手的部分是,管道可能会在进入生产之前多次进入开发阶段,甚至可能在之后再次进入生产阶段而不返回开发阶段,正如您在上面的数据框中看到的那样。

我想弄清楚如何创建循环(或 dplyr/purrr 命令链或某些 *ply 函数)来对齐数据,以便我可以使用 diffTime 来获取部署持续时间。最终目标是获取从开发到生产的所有流水线的 diffTimes,然后取这个数字的平均值。 为了实现我的目标,我正在通过尝试将数据变成这样的东西来解决这个问题(在操作之后,env_type 将不再有效——但这没关系,因为我只对 diffTime 感兴趣结束):

Team_ID env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time diffTime
1       A     PROD 2018-08-01 12:00:00 2018-08-01 13:00:00 2018-08-02 14:00:00 2018-08-02 15:00:00  2678400
2       B     PROD 2018-08-02 12:00:00 2018-08-02 13:00:00 2018-08-02 16:00:00 2018-08-02 17:00:00    18000

英文的话,我觉得我需要的是:

对于 env_type == "PROD" 的每一行,找到最接近 Dev 的时间戳,并用该值覆盖 Dev 列——类似于 max(dev_end_time 其中 dev_end_time 不大于 prod_start_time 且 dev_end_time 大于 prod_end_time 的先前值)。我知道数据需要按 Team_ID 和 ar运行 顺序分组。我也知道我必须从查看产品管道开始,然后倒退。

我是从这个开始的:

df %>% 
    group_by(Team_ID) %>% 
    arrange(Team_ID, startTime) 

以便数据按时间顺序分组和ar运行ged。但是我应该从这里去哪里呢?我首先想到 mutate 可能会起作用: mutate(dev_start_time = ifelse((dev_end_time < prod_start_time) & (dev_end_time > prod_start_time -1)), dev_start_time, "") 但我不知道如何让 R 查看正确的行(prod_start_time -1 应该是产品的前一行而不是时间 -1)。

我知道必须有一些方法可以做到这一点,但我只是不熟悉完成它的功能。

编辑:

对于@LetEpsilonBeLessThanZero 我试图通过 pipeline_id 获得该分组的重点,然后过滤至少具有 1 个开发行和 1 个生产行的数据将删除有价值的数据。为了证明这一点,让我们看看下面的数据:

Team_ID pipeline_id env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time
1        A        1000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
2        A        1000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
3        A        1000     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
4        A        1000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
5        B        2000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
6        B        2000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
7        B        2000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
8        C        3000      DEV 2018-08-05 12:00:00 2018-08-05 13:00:00                <NA>                <NA>
9        C        3000      DEV 2018-08-06 12:00:00 2018-08-06 13:00:00                <NA>                <NA>
10       C        3000     TEST 2018-08-06 14:00:00 2018-08-06 15:00:00                <NA>                <NA>
11       D        4000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
12       D        4000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
13       D        5000     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
14       D        5000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00

请注意 D 团队如何创建独特的 Dev 管道和独特的 Prod 管道。我仍然需要一种方法来 link 它们并测量时差,因为我知道部署用于同一个应用程序,但它不能按照你建议的方式通过在 pipeline_id 上分组来完成。

另一方面,我知道我们需要一种新方法将这些团队组合在一起,以便更轻松地关联这些工作,现在有计划实现这一目标。但是我仍然必须找到一种方法来尽可能地利用我目前拥有的数据来获取这些数据,因此感谢所有帮助。

下面的代码怎么样?我修改了您的一个虚拟数据集,以便我可以测试一些不同的场景。

df dataframe 是未更改的虚拟数据集。

df_w_implied_proj_id 将向您展示我如何确定 "proj_id",这是我创建的字段。 proj_id 表示 "true" 管道。

mean_dev_df 计算了 proj_id 之间的平均总 diffTime。

library(dplyr)

df = data.frame(startTime = as.POSIXct(c("2018-08-01 12:00:00",
                                         "2018-08-02 10:00:00",
                                         "2018-08-02 14:00:00",
                                         "2018-08-02 16:00:00",
                                         "2018-08-01 12:00:00",
                                         "2018-08-02 12:00:00",
                                         "2018-08-02 16:00:00",
                                         "2018-08-05 12:00:00",
                                         "2018-08-06 12:00:00",
                                         "2018-08-06 14:00:00",
                                         "2018-08-06 16:00:00",
                                         "2018-08-06 18:00:00",
                                         "2018-08-01 12:00:00",
                                         "2018-08-02 12:00:00",
                                         "2018-08-02 14:00:00",
                                         "2018-08-02 16:00:00"), format="%Y-%m-%d %H:%M:%S"),
                endTime = as.POSIXct(c("2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 15:00:00",
                                       "2018-08-02 18:00:00",
                                       "2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 18:00:00",
                                       "2018-08-05 13:00:00",
                                       "2018-08-06 13:00:00",
                                       "2018-08-06 15:00:00",
                                       "2018-08-06 17:00:00",
                                       "2018-08-06 19:00:00",
                                       "2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 15:00:00",
                                       "2018-08-02 21:00:00"), format="%Y-%m-%d %H:%M:%S"),
                env_type = c("DEV","DEV","PROD","PROD","DEV","DEV","PROD","DEV","DEV","PROD","DEV","PROD","DEV","DEV","PROD","PROD"),
                Team_ID = c("A","A","A","A","B","B","B","C","C","C","C","C","D","D","D","D"))

df_w_implied_proj_id = df %>%
  arrange(Team_ID, startTime) %>%
  mutate(diffTimeSecs = difftime(endTime,startTime,units="secs"),
         proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1) %>%
  group_by(proj_id) %>%
  mutate(total_proj_diffTimeSecs = sum(diffTimeSecs))

mean_dev_df = df_w_implied_proj_id %>%
  group_by(proj_id) %>%
  summarise(temp_totals = sum(diffTimeSecs)) %>%
  ungroup() %>%
  summarise(mean_total_proj_diffTimeSecs = mean(temp_totals))

这段代码的主要工蜂是这一行:

proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1

为了理解它,让我们看一下数据集中的env_type个值:

env_type
DEV
DEV
PROD
PROD
DEV
DEV
PROD
DEV
DEV
PROD
DEV
PROD
DEV
DEV
PROD
PROD

lag 函数只是 return 前一行的值。因此,作为一个随机示例,lag(c("A","B","C"),default="BALLOON") 将 return c("BALLOON","A","B")

所以 env_type != lag(env_type, default = first(env_type)) 会 return 这个:

env_type != lag(env_type, default = first(env_type))
0 (note: there's no row before the first row, so the lag statement defaults this to the first element of env_type vector, which is "DEV". And "DEV" != "DEV" evaluates to FALSE aka 0)
0 (note: "DEV" != "DEV" evaluates to FALSE aka 0)
1 (note: "PROD" != "DEV" evaluates to TRUE aka 1)
0 (note: "PROD != "PROD" evaluates to FALSE aka 0. By now you hopefully get the gist of what's going on.)
1
0
1
1
0
1
1
1
1
0
1
0

然后 cumsum(...) 的 0 和 1 向量结果是:

0 0 1 1 2 2 3 4 4 5 6 7 8 8 9 9

每增加 1 表示从 "DEV" 切换到 "PROD",反之亦然。

然后我们可以通过整数将每个数字除以 2 然后加 1 将每个偶数与其奇数后继压缩在一起得到:

1 1 1 1 2 2 2 3 3 3 4 4 5 5 5 5

这些是我们最后的 proj_id。

答案真的归功于 letepsilonbelessthanzero,因为他为我提供了一些关于 dplyr::lag() 的指导。但我已经测试了以下解决方案,它完全符合我的需要。

df %>% 
    group_by(Team_ID) %>% 
    arrange(Team_ID, startTime) %>% 
    mutate("Dev-Prod" = as.numeric(difftime(prod_end_time, lag(dev_start_time), units = "secs"))) %>%
    filter(!is.na(`Dev-Prod`))