向量化计算与最低日期值 R 的差异
Vectorizing Calculation of difference from lowest date value R
我正在处理一个具有 4 个特征的数据框,County
、State
、#COVID cases
和 date
。我想添加一个列来计算自该县的最低日期值以来的天数。我找到了一种方法,但它需要一个 for 循环,并且考虑到有超过 60k 行,执行时间太长。我很困惑是否以及如何以矢量方式计算它,因此需要合理的时间。
daysSinceFirstCase <- function (x) {
# create vector the length of the row count
vals <- 1:nrow(x)
# for each row
for(i in 1:nrow(x)) {
row <- x[i, ]
# get occurrences of that county and state
countyCases <- x[x$county == row$county & x$state == row$state,]
# get first date
firstDate <- countyCases[order(countyCases$date),]$date[1]
#calculate difference
diff <- as.integer(row$date - firstDate)
#store difference
vals[i] <- diff
print(i)
}
return(vals)
}
df['days_since_first_case'] <- daysSinceFirstCase(df)
编辑:这是我的数据示例和我要创建的列。
Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2
为了回答你的问题,你的代码没有向量化。
# get first date
firstDate <- countyCases[order(countyCases$date),]$date[1]
我想你正在循环这行 60k 次。一个潜在的改进是为每组国家+州循环此行一次,而不是每一行。
或者你试试下面的data.table
解决方案
library(data.table)
library(lubridate)
dt <- fread('Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2')
dt[,Date:=ymd(Date)]
dt[,first_case_date:=Date[which(Cases==min(Cases))],by=.(County)]
dt
#> Date County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia PA 500 0 2020-03-14
#> 2: 2020-03-15 Philadelphia PA 892 1 2020-03-14
#> 3: 2020-03-16 Philadelphia PA 1502 2 2020-03-14
#> 4: 2020-03-22 Baltimore MD 12 0 2020-03-22
#> 5: 2020-03-23 Baltimore MD 152 1 2020-03-22
#> 6: 2020-03-24 Baltimore MD 348 2 2020-03-22
dt[,Days_since_first_case:= Date-first_case_date]
dt
#> Date County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia PA 500 0 2020-03-14
#> 2: 2020-03-15 Philadelphia PA 892 1 2020-03-14
#> 3: 2020-03-16 Philadelphia PA 1502 2 2020-03-14
#> 4: 2020-03-22 Baltimore MD 12 0 2020-03-22
#> 5: 2020-03-23 Baltimore MD 152 1 2020-03-22
#> 6: 2020-03-24 Baltimore MD 348 2 2020-03-22
#> Days_since_first_case
#> 1: 0 days
#> 2: 1 days
#> 3: 2 days
#> 4: 0 days
#> 5: 1 days
#> 6: 2 days
由 reprex package (v0.3.0)
于 2020-04-19 创建
我不确定性能,因为下面的测试仍然只有 2 组数据。您可以在您的真实数据集上进行测试。
library(data.table)
library(lubridate)
library(microbenchmark)
dt <- fread('Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2')
dt <- rbindlist(replicate(10000,dt,simplify = FALSE)) #60k records
dt[,Date:=ymd(Date)]
#key line for result
microbenchmark(dt[,first_case_date:=head(Date[which(Cases==min(Cases))],1),by=.(County)])
#> Unit: milliseconds
#>
#> expr: dt[, `:=`(first_case_date, head(Date[which(Cases == min(Cases))],1)), by = .(County)]
#> min lq mean median uq max neval
#> 1.6829 1.7602 2.015732 1.8329 2.1797 4.3841 100
考虑ave
按组求最小值并求差
df['days_since_first_case'] <- with(df, as.integer(Date - ave(Date, County, State, FUN=min)))
或者,运行 aggregate
+ merge
,然后取差:
df <- within(merge(df, aggregate(cbind(Min_Date=Date) ~ County + State, df, FUN=min),
by = c("County", "State")), {
days_since_first_case <- as.integer(Date - Min_Date)
rm(Min_Date)
})
我们可以用每个 County
和 State
的最小日期减去当前日期。
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(County, State) %>%
mutate(Days_since_first_case = as.integer(Date - min(Date)))
# Date County State Cases Days_since_first_case
# <date> <chr> <chr> <int> <int>
#1 2020-03-14 Philadelphia PA 500 0
#2 2020-03-15 Philadelphia PA 892 1
#3 2020-03-16 Philadelphia PA 1502 2
#4 2020-03-22 Baltimore MD 12 0
#5 2020-03-23 Baltimore MD 152 1
#6 2020-03-24 Baltimore MD 348 2
如果每天都有记录,也可以从第一个dat开始算行号。
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(County, State, Date) %>%
group_by(County, State) %>%
mutate(Days_since_first_case = row_number() - 1)
我正在处理一个具有 4 个特征的数据框,County
、State
、#COVID cases
和 date
。我想添加一个列来计算自该县的最低日期值以来的天数。我找到了一种方法,但它需要一个 for 循环,并且考虑到有超过 60k 行,执行时间太长。我很困惑是否以及如何以矢量方式计算它,因此需要合理的时间。
daysSinceFirstCase <- function (x) {
# create vector the length of the row count
vals <- 1:nrow(x)
# for each row
for(i in 1:nrow(x)) {
row <- x[i, ]
# get occurrences of that county and state
countyCases <- x[x$county == row$county & x$state == row$state,]
# get first date
firstDate <- countyCases[order(countyCases$date),]$date[1]
#calculate difference
diff <- as.integer(row$date - firstDate)
#store difference
vals[i] <- diff
print(i)
}
return(vals)
}
df['days_since_first_case'] <- daysSinceFirstCase(df)
编辑:这是我的数据示例和我要创建的列。
Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2
为了回答你的问题,你的代码没有向量化。
# get first date
firstDate <- countyCases[order(countyCases$date),]$date[1]
我想你正在循环这行 60k 次。一个潜在的改进是为每组国家+州循环此行一次,而不是每一行。
或者你试试下面的data.table
解决方案
library(data.table)
library(lubridate)
dt <- fread('Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2')
dt[,Date:=ymd(Date)]
dt[,first_case_date:=Date[which(Cases==min(Cases))],by=.(County)]
dt
#> Date County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia PA 500 0 2020-03-14
#> 2: 2020-03-15 Philadelphia PA 892 1 2020-03-14
#> 3: 2020-03-16 Philadelphia PA 1502 2 2020-03-14
#> 4: 2020-03-22 Baltimore MD 12 0 2020-03-22
#> 5: 2020-03-23 Baltimore MD 152 1 2020-03-22
#> 6: 2020-03-24 Baltimore MD 348 2 2020-03-22
dt[,Days_since_first_case:= Date-first_case_date]
dt
#> Date County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia PA 500 0 2020-03-14
#> 2: 2020-03-15 Philadelphia PA 892 1 2020-03-14
#> 3: 2020-03-16 Philadelphia PA 1502 2 2020-03-14
#> 4: 2020-03-22 Baltimore MD 12 0 2020-03-22
#> 5: 2020-03-23 Baltimore MD 152 1 2020-03-22
#> 6: 2020-03-24 Baltimore MD 348 2 2020-03-22
#> Days_since_first_case
#> 1: 0 days
#> 2: 1 days
#> 3: 2 days
#> 4: 0 days
#> 5: 1 days
#> 6: 2 days
由 reprex package (v0.3.0)
于 2020-04-19 创建我不确定性能,因为下面的测试仍然只有 2 组数据。您可以在您的真实数据集上进行测试。
library(data.table)
library(lubridate)
library(microbenchmark)
dt <- fread('Date | County | State | Cases | Days since first case
2020-03-14 | Philadelphia | PA | 500 | 0
2020-03-15 | Philadelphia | PA | 892 | 1
2020-03-16 | Philadelphia | PA | 1502 | 2
2020-03-22 | Baltimore | MD | 12 | 0
2020-03-23 | Baltimore | MD | 152 | 1
2020-03-24 | Baltimore | MD | 348 | 2')
dt <- rbindlist(replicate(10000,dt,simplify = FALSE)) #60k records
dt[,Date:=ymd(Date)]
#key line for result
microbenchmark(dt[,first_case_date:=head(Date[which(Cases==min(Cases))],1),by=.(County)])
#> Unit: milliseconds
#>
#> expr: dt[, `:=`(first_case_date, head(Date[which(Cases == min(Cases))],1)), by = .(County)]
#> min lq mean median uq max neval
#> 1.6829 1.7602 2.015732 1.8329 2.1797 4.3841 100
考虑ave
按组求最小值并求差
df['days_since_first_case'] <- with(df, as.integer(Date - ave(Date, County, State, FUN=min)))
或者,运行 aggregate
+ merge
,然后取差:
df <- within(merge(df, aggregate(cbind(Min_Date=Date) ~ County + State, df, FUN=min),
by = c("County", "State")), {
days_since_first_case <- as.integer(Date - Min_Date)
rm(Min_Date)
})
我们可以用每个 County
和 State
的最小日期减去当前日期。
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(County, State) %>%
mutate(Days_since_first_case = as.integer(Date - min(Date)))
# Date County State Cases Days_since_first_case
# <date> <chr> <chr> <int> <int>
#1 2020-03-14 Philadelphia PA 500 0
#2 2020-03-15 Philadelphia PA 892 1
#3 2020-03-16 Philadelphia PA 1502 2
#4 2020-03-22 Baltimore MD 12 0
#5 2020-03-23 Baltimore MD 152 1
#6 2020-03-24 Baltimore MD 348 2
如果每天都有记录,也可以从第一个dat开始算行号。
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(County, State, Date) %>%
group_by(County, State) %>%
mutate(Days_since_first_case = row_number() - 1)