使用 dplyr 和 lubridate 按月和年合并计数和分组

Using dplyr and lubridate to combine count and group by month and year

我有一个数据框,其中每一行代表一个城市中发生的单一事件。数据框显示城市名称和发生日期,如下所示:

df <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "Chicago",
                         "Chicago", "Chicago", "Chicago", "Chicago"),
                     date_of_event = c("01/13/2011", "01/17/2011", "03/15/2011", "05/21/2011", "05/23/2011",
                                      "01/20/2011", "01/22/2011", "03/23/2011", "01/18/2011", "02/24/2011",
                                       "02/26/2011", "04/30/2011", "06/18/2011"),
                     stringsAsFactors = FALSE)

df$date_of_event <- as.Date(df$date_of_event, "%m/%d/%Y")

以上只是一个例子,我的数据实际上是一个包含数千行、许多城市、许多日期等的 csv。我想做的是生成一个新的数据框,每个城市都有一行,每个 month/year 在数据集中表示,以及一个相应的计数列,显示原始数据框中每个城市每个月发生的次数。第二个数据框看起来像这样:

df2 <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "NYC",
                           "NYC", "NYC", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago"),
                     month_year = c("01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
                                    "01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
                                    "01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011"),
                  count = c(2, 0, 1, 0, 2, 0, 2, 0, 1, 0, 0, 0, 1, 2, 0, 1, 0, 1),
                     stringsAsFactors = FALSE)

df2$month_year <- as.Date(df2$month_year, "%m/%d/%Y")

我知道您可以使用 dplyr 中的 count 和 lubridate 将日期舍入到每个月的第一天,但​​我已经尝试并未能正确进行分组和计数以生成第二个数据帧我要。

你可以试试这个:

library(tidyverse)
library(lubridate)

df3 <- df %>% mutate(new_date = floor_date(date_of_event, "month")) 
tt <- as.data.frame(table(df3[-2])) 
tt[order(desc(tt$city), tt$new_date),]

      city   new_date Freq
   Seattle 2011-01-01    2
   Seattle 2011-02-01    0
   Seattle 2011-03-01    1
   Seattle 2011-04-01    0
   Seattle 2011-05-01    2
   Seattle 2011-06-01    0
       NYC 2011-01-01    2
       NYC 2011-02-01    0
       NYC 2011-03-01    1
       NYC 2011-04-01    0
       NYC 2011-05-01    0
       NYC 2011-06-01    0
   Chicago 2011-01-01    1
   Chicago 2011-02-01    2
   Chicago 2011-03-01    0
   Chicago 2011-04-01    1
   Chicago 2011-05-01    0
   Chicago 2011-06-01    1

要包括零计数的延长期,您可以试试这个:

# assign a name to the output obtained previously
df4 <- tt[order(desc(tt$city), tt$new_date),]

a <- mdy("01/01/11") # starting period 
b <- a + months(0:92)  # period sequence

df5 <- expand.grid(city = c("Chicago", "Seattle", "NYC"), new_date = as.factor(b)) 

df6 <- setdiff(df5, df4[-3])
df6$Freq <- 0 # assign zero count

df7 <- rbind(df4, df6)

df8 <- df7[order(df7$city, df7$new_date), ]