使用 R 或 Python 中的 If Else 语句创建组、中断和条件
Create groups, breaks and conditions using If Else statements in R or Python
我有一个大型数据集(200 万条记录)df,我正在尝试在日期时间内对其进行分组和创建中断。如果满足以下条件,我想定义一个组并创建这些"breaks":(这是一个大数据集,我不知道主题、收件人和长度列的内容)
If the edit == "T"
If the message is ""
If the folder is "out" or "draft"
如果长度列的最后一个值,我想匹配这些组
匹配下一组的值,长度列的第一行。因此,例如,值“80”连接组以及编辑为 T,文件夹已出或草稿,消息为空白。
subject recipient length folder message date edit
80 out 1/2/2020 1:00:01 AM T
80 out 1/2/2020 1:00:05 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:10 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:15 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:30 AM T
some k 900 in jjjjj 1/2/2020 1:00:35 AM F
some k 900 in jjjjj 1/2/2020 1:00:36 AM F
some k 900 in jjjjj 1/2/2020 1:00:37 AM F
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:02:05 AM T
no a 900 in iii 1/2/2020 1:02:10 AM F
no a 900 in iii 1/2/2020 1:02:15 AM F
no a 900 in iii 1/2/2020 1:02:20 AM F
no a 900 in iii 1/2/2020 1:02:25 AM F
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:03:00 AM T
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:03:20 AM T
那我想link把这些群放在一起
如果一个块的最后一行的长度,
匹配下一个块的第一行的长度列。
我已经开始修改下面的代码,但不确定如何执行它。
这是期望的输出:
Start End Duration Group Subject Length
1/2/2020 1:00:01 AM 1/2/2020 1:00:30 AM 29 A hey 80
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A hey 80
1/2/2020 1:03:00 AM 1/2/2020 1:03:20 AM 20 A hey 80
所有这些都在同一组 A 中,因为长度列的最后一行与下一组长度列的第一行匹配。
library(tidyverse)
library(lubridate)
df$Date <- lubridate::dmy_hms(df$Date)
df <- mutate_if(df, is.factor, as.character)
df$GROUP <- "Edit == "T", Folder == "out"|"draft", Message == """
df$BREAK_DETECTOR <- ""
group_count <- 0
break_count <- 0
for (i in 1:nrow(df)) {
if (i == 1) {
group_count <- group_count + 1
df$GROUP[[i]] <- letters[[group_count]]
}
if (i > 1) {
if (df$GROUP[[i - 1]] != "") {
df$GROUP[[i]] <- df$GROUP[[i - 1]]
} else {
group_count <- group_count + 1
df$GROUP[[i]] <- letters[[group_count]]
}
}
if (i == 1) {
break_count <- break_count + 1
df$BREAK_DETECTOR[[i]] <- break_count
} else { #rules for detecting breaks - I chose to make it depend on NA values in the Length field
if (is.na(df$Length[[i]])) {
if (!is.na(df$Length[[i - 1]])) { # and only if the previous line isnt also NA for Length
break_count <- break_count + 1
}
}
df$BREAK_DETECTOR[[i]] <- break_count
}
}
df2 <- df %>%
filter(!is.na(Length)) %>%
group_by(
GROUP, BREAK_DETECTOR
) %>%
summarise(
start = min(Date),
end = max(Date),
duration = difftime(end, start, units = "secs"),
min_subject = min(Subject),
max_subject = max(Subject),
min_recipient = min(Recipient),
max_recipient = max(Recipient),
min_length = min(Length),
max_length = max(Length)
) %>%
ungroup()
这里是输入:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("",
"hey", "no", "some"), class = "factor"), Recipient = structure(c(1L,
1L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 5L, 3L, 3L, 3L, 3L, 5L, 5L, 1L,
2L), .Label = c("", " ", "a", "k", "sarah@mail.com,gee@mail.com"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 900L,
900L, 900L, 80L, 80L, 900L, 900L, 900L, 900L, 80L, 80L, NA, NA
), Folder = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 2L, 2L,
3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "draft", "in",
"out"), class = "factor"), Message = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("",
"jjjjjjj", "llll"), class = "factor"), Date = structure(c(2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 1L, 1L), .Label = c("", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM",
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM",
"1/2/2020 1:00:35 AM", "1/2/2020 1:00:36 AM", "1/2/2020 1:00:37 AM",
"1/2/2020 1:02:00 AM", "1/2/2020 1:02:05 AM", "1/2/2020 1:02:10 AM",
"1/2/2020 1:02:15 AM", "1/2/2020 1:02:20 AM", "1/2/2020 1:02:25 AM",
"1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"), class = "factor"),
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA
)), class = "data.frame", row.names = c(NA, -18L))
使用dplyr
:
library(dplyr)
df %>%
#Add row number
mutate(row = row_number(),
#Convert to Posixct
Date = lubridate::mdy_hms(Date)) %>%
#Keep only TRUE rows
filter(Edit) %>%
#Create groups
group_by(gr = cumsum(c(TRUE, diff(row) > 1))) %>%
#Get first, last and difference between the dates
summarise(Start = first(Date),
End = last(Date),
Duration = difftime(End, Start, "secs"),
Group = "A", Subject = "hey", Length = 80) %>%
select(-gr)
# A tibble: 3 x 6
# Start End Duration Group Subject Length
# <dttm> <dttm> <drtn> <chr> <chr> <dbl>
#1 2020-01-02 01:00:01 2020-01-02 01:00:30 29 secs A hey 80
#2 2020-01-02 01:02:00 2020-01-02 01:02:05 5 secs A hey 80
#3 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs A hey 80
我有一个大型数据集(200 万条记录)df,我正在尝试在日期时间内对其进行分组和创建中断。如果满足以下条件,我想定义一个组并创建这些"breaks":(这是一个大数据集,我不知道主题、收件人和长度列的内容)
If the edit == "T"
If the message is ""
If the folder is "out" or "draft"
如果长度列的最后一个值,我想匹配这些组 匹配下一组的值,长度列的第一行。因此,例如,值“80”连接组以及编辑为 T,文件夹已出或草稿,消息为空白。
subject recipient length folder message date edit
80 out 1/2/2020 1:00:01 AM T
80 out 1/2/2020 1:00:05 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:10 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:15 AM T
hey sarah@mail.com,g@mail.com 80 out 1/2/2020 1:00:30 AM T
some k 900 in jjjjj 1/2/2020 1:00:35 AM F
some k 900 in jjjjj 1/2/2020 1:00:36 AM F
some k 900 in jjjjj 1/2/2020 1:00:37 AM F
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:02:05 AM T
no a 900 in iii 1/2/2020 1:02:10 AM F
no a 900 in iii 1/2/2020 1:02:15 AM F
no a 900 in iii 1/2/2020 1:02:20 AM F
no a 900 in iii 1/2/2020 1:02:25 AM F
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:03:00 AM T
hey sarah@mail.com,g@mail.com 80 draft 1/2/2020 1:03:20 AM T
那我想link把这些群放在一起 如果一个块的最后一行的长度, 匹配下一个块的第一行的长度列。 我已经开始修改下面的代码,但不确定如何执行它。
这是期望的输出:
Start End Duration Group Subject Length
1/2/2020 1:00:01 AM 1/2/2020 1:00:30 AM 29 A hey 80
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A hey 80
1/2/2020 1:03:00 AM 1/2/2020 1:03:20 AM 20 A hey 80
所有这些都在同一组 A 中,因为长度列的最后一行与下一组长度列的第一行匹配。
library(tidyverse)
library(lubridate)
df$Date <- lubridate::dmy_hms(df$Date)
df <- mutate_if(df, is.factor, as.character)
df$GROUP <- "Edit == "T", Folder == "out"|"draft", Message == """
df$BREAK_DETECTOR <- ""
group_count <- 0
break_count <- 0
for (i in 1:nrow(df)) {
if (i == 1) {
group_count <- group_count + 1
df$GROUP[[i]] <- letters[[group_count]]
}
if (i > 1) {
if (df$GROUP[[i - 1]] != "") {
df$GROUP[[i]] <- df$GROUP[[i - 1]]
} else {
group_count <- group_count + 1
df$GROUP[[i]] <- letters[[group_count]]
}
}
if (i == 1) {
break_count <- break_count + 1
df$BREAK_DETECTOR[[i]] <- break_count
} else { #rules for detecting breaks - I chose to make it depend on NA values in the Length field
if (is.na(df$Length[[i]])) {
if (!is.na(df$Length[[i - 1]])) { # and only if the previous line isnt also NA for Length
break_count <- break_count + 1
}
}
df$BREAK_DETECTOR[[i]] <- break_count
}
}
df2 <- df %>%
filter(!is.na(Length)) %>%
group_by(
GROUP, BREAK_DETECTOR
) %>%
summarise(
start = min(Date),
end = max(Date),
duration = difftime(end, start, units = "secs"),
min_subject = min(Subject),
max_subject = max(Subject),
min_recipient = min(Recipient),
max_recipient = max(Recipient),
min_length = min(Length),
max_length = max(Length)
) %>%
ungroup()
这里是输入:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("",
"hey", "no", "some"), class = "factor"), Recipient = structure(c(1L,
1L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 5L, 3L, 3L, 3L, 3L, 5L, 5L, 1L,
2L), .Label = c("", " ", "a", "k", "sarah@mail.com,gee@mail.com"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 900L,
900L, 900L, 80L, 80L, 900L, 900L, 900L, 900L, 80L, 80L, NA, NA
), Folder = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 2L, 2L,
3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "draft", "in",
"out"), class = "factor"), Message = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("",
"jjjjjjj", "llll"), class = "factor"), Date = structure(c(2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 1L, 1L), .Label = c("", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM",
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM",
"1/2/2020 1:00:35 AM", "1/2/2020 1:00:36 AM", "1/2/2020 1:00:37 AM",
"1/2/2020 1:02:00 AM", "1/2/2020 1:02:05 AM", "1/2/2020 1:02:10 AM",
"1/2/2020 1:02:15 AM", "1/2/2020 1:02:20 AM", "1/2/2020 1:02:25 AM",
"1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"), class = "factor"),
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA
)), class = "data.frame", row.names = c(NA, -18L))
使用dplyr
:
library(dplyr)
df %>%
#Add row number
mutate(row = row_number(),
#Convert to Posixct
Date = lubridate::mdy_hms(Date)) %>%
#Keep only TRUE rows
filter(Edit) %>%
#Create groups
group_by(gr = cumsum(c(TRUE, diff(row) > 1))) %>%
#Get first, last and difference between the dates
summarise(Start = first(Date),
End = last(Date),
Duration = difftime(End, Start, "secs"),
Group = "A", Subject = "hey", Length = 80) %>%
select(-gr)
# A tibble: 3 x 6
# Start End Duration Group Subject Length
# <dttm> <dttm> <drtn> <chr> <chr> <dbl>
#1 2020-01-02 01:00:01 2020-01-02 01:00:30 29 secs A hey 80
#2 2020-01-02 01:02:00 2020-01-02 01:02:05 5 secs A hey 80
#3 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs A hey 80