向量化计算与最低日期值 R 的差异

Question

我正在处理一个具有 4 个特征的数据框，County、State、#COVID cases 和 date。我想添加一个列来计算自该县的最低日期值以来的天数。我找到了一种方法，但它需要一个 for 循环，并且考虑到有超过 60k 行，执行时间太长。我很困惑是否以及如何以矢量方式计算它，因此需要合理的时间。

daysSinceFirstCase <- function (x) {
    # create vector the length of the row count 
    vals <- 1:nrow(x)

    # for each row 
    for(i in 1:nrow(x)) {
        row <- x[i, ]
        # get occurrences of that county and state
        countyCases <- x[x$county == row$county & x$state == row$state,]

        # get first date
        firstDate <- countyCases[order(countyCases$date),]$date[1]

        #calculate difference
        diff <- as.integer(row$date - firstDate)

        #store difference
        vals[i] <- diff 
        print(i)
    }
    return(vals)
}
df['days_since_first_case'] <- daysSinceFirstCase(df)

编辑：这是我的数据示例和我要创建的列。

Date       |  County      | State | Cases | Days since first case 
2020-03-14 | Philadelphia | PA    | 500   | 0
2020-03-15 | Philadelphia | PA    | 892   | 1
2020-03-16 | Philadelphia | PA    | 1502  | 2
2020-03-22 | Baltimore    | MD    | 12    | 0
2020-03-23 | Baltimore    | MD    | 152   | 1
2020-03-24 | Baltimore    | MD    | 348   | 2

Answer 1

为了回答你的问题，你的代码没有向量化。

# get first date
firstDate <- countyCases[order(countyCases$date),]$date[1]

我想你正在循环这行 60k 次。一个潜在的改进是为每组国家+州循环此行一次，而不是每一行。

或者你试试下面的data.table解决方案

library(data.table)
library(lubridate)

dt <- fread('Date       |  County      | State | Cases | Days since first case 
2020-03-14 | Philadelphia | PA    | 500   | 0
2020-03-15 | Philadelphia | PA    | 892   | 1
2020-03-16 | Philadelphia | PA    | 1502  | 2
2020-03-22 | Baltimore    | MD    | 12    | 0
2020-03-23 | Baltimore    | MD    | 152   | 1
2020-03-24 | Baltimore    | MD    | 348   | 2')


dt[,Date:=ymd(Date)]

dt[,first_case_date:=Date[which(Cases==min(Cases))],by=.(County)]

dt
#>          Date       County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia    PA   500                     0      2020-03-14
#> 2: 2020-03-15 Philadelphia    PA   892                     1      2020-03-14
#> 3: 2020-03-16 Philadelphia    PA  1502                     2      2020-03-14
#> 4: 2020-03-22    Baltimore    MD    12                     0      2020-03-22
#> 5: 2020-03-23    Baltimore    MD   152                     1      2020-03-22
#> 6: 2020-03-24    Baltimore    MD   348                     2      2020-03-22

dt[,Days_since_first_case:= Date-first_case_date]

dt
#>          Date       County State Cases Days since first case first_case_date
#> 1: 2020-03-14 Philadelphia    PA   500                     0      2020-03-14
#> 2: 2020-03-15 Philadelphia    PA   892                     1      2020-03-14
#> 3: 2020-03-16 Philadelphia    PA  1502                     2      2020-03-14
#> 4: 2020-03-22    Baltimore    MD    12                     0      2020-03-22
#> 5: 2020-03-23    Baltimore    MD   152                     1      2020-03-22
#> 6: 2020-03-24    Baltimore    MD   348                     2      2020-03-22
#>    Days_since_first_case
#> 1:                0 days
#> 2:                1 days
#> 3:                2 days
#> 4:                0 days
#> 5:                1 days
#> 6:                2 days

^{由 reprex package (v0.3.0)}

于 2020-04-19 创建

我不确定性能，因为下面的测试仍然只有 2 组数据。您可以在您的真实数据集上进行测试。

library(data.table)
library(lubridate)
library(microbenchmark)

dt <- fread('Date       |  County      | State | Cases | Days since first case 
2020-03-14 | Philadelphia | PA    | 500   | 0
2020-03-15 | Philadelphia | PA    | 892   | 1
2020-03-16 | Philadelphia | PA    | 1502  | 2
2020-03-22 | Baltimore    | MD    | 12    | 0
2020-03-23 | Baltimore    | MD    | 152   | 1
2020-03-24 | Baltimore    | MD    | 348   | 2')

dt <- rbindlist(replicate(10000,dt,simplify = FALSE)) #60k records


dt[,Date:=ymd(Date)]

#key line for result
microbenchmark(dt[,first_case_date:=head(Date[which(Cases==min(Cases))],1),by=.(County)])
#> Unit: milliseconds
#>                                                                                      
#> expr: dt[, `:=`(first_case_date, head(Date[which(Cases == min(Cases))],1)), by = .(County)]
#>     min     lq     mean median     uq    max neval
#>  1.6829 1.7602 2.015732 1.8329 2.1797 4.3841   100

Answer 2

考虑ave按组求最小值并求差

df['days_since_first_case'] <- with(df, as.integer(Date - ave(Date, County, State, FUN=min)))

或者，运行 aggregate + merge，然后取差：

df <- within(merge(df, aggregate(cbind(Min_Date=Date) ~ County + State, df, FUN=min),
                   by = c("County", "State")), {
                 days_since_first_case <- as.integer(Date - Min_Date)
                 rm(Min_Date)
             })

Answer 3

我们可以用每个 County 和 State 的最小日期减去当前日期。

library(dplyr)

df %>%
  mutate(Date = as.Date(Date)) %>%
  group_by(County, State) %>%
  mutate(Days_since_first_case = as.integer(Date - min(Date)))


#  Date       County       State Cases Days_since_first_case
#  <date>     <chr>        <chr> <int>                 <int>
#1 2020-03-14 Philadelphia PA      500                     0
#2 2020-03-15 Philadelphia PA      892                     1
#3 2020-03-16 Philadelphia PA     1502                     2
#4 2020-03-22 Baltimore    MD       12                     0
#5 2020-03-23 Baltimore    MD      152                     1
#6 2020-03-24 Baltimore    MD      348                     2

如果每天都有记录，也可以从第一个dat开始算行号。

df %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(County, State, Date) %>%
  group_by(County, State) %>%
  mutate(Days_since_first_case = row_number() - 1)

向量化计算与最低日期值 R 的差异

Vectorizing Calculation of difference from lowest date value R

r

vectorization

dataframe