每年使用 dplyr 或数据表的公司数量

Number of firms per year using dplyr or datatable

假设我有数据框:

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
                 YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
                 YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))

其中 YearFrom 是年份,例如公司成立,YearTo 是注销的年份。 如果 YearTo 是 NA 那么它仍然有效。

我想计算每年的公司数量。

table 应该是这样的

City    |"Year"   |"Count"
"NY"    |2001       1
"NY"    |2002       2
"NY"    |2003       3
"NY"    |2004       3
"NY"    |2005       2
"NY"    |2006       3
"NY"    |2007       3
"NY"    |2008       4
"NY"    |2009       3
"LA"    |2001       0
"LA"    |2002       1
"LA"    |2003       1
"LA"    |2004       2
"LA"    |2005       4
"LA"    |2006       4
"LA"    |2007       4
"LA"    |2008       2
"LA"    |2009       2

我想通过 dplyr 或 datatable 包解决这个问题,但我不知道如何解决?

此解决方案使用 dplyrtidyr

library(dplyr)
library(tidyr)

df %>%
  # Change YearFrom and YearTo to numeric
  mutate(YearFrom = as.numeric(as.character(YearFrom)), 
         YearTo = as.numeric(as.character(YearTo))) %>%
  # Replace NA with 2017 in YearTo
  mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>%
  # All number in YearTo minus 1 to exclude the year of cancellation
  mutate(YearTo = YearTo - 1) %>%
  # Group by row
  rowwise() %>%
  # Create a tbl for each row, expand the Year column based on YearFrom and YearTo
  do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>%
  ungroup() %>%
  # Count the number of each City and Year
  count(City, Year) %>%
  # Rename the column n to Count
  rename(Count = n) %>%
  # Spread the data frame to find the implicity missing value in LA, 2001
  spread(Year, Count) %>%
  # Gather the data frame to account for the missing value in LA, 2001
  gather(Year, Count, - City) %>%
  # Replace NA with 0 in Count
  mutate(Count = ifelse(is.na(Count), 0L, Count)) %>%
  # Arrange the data 
  arrange(desc(City), Year) %>%
  # Filter the data until 2009
  filter(Year <= 2009)

这是一个使用 data.table 的答案。资料准备在最下方

# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
              function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])

# combine to a single data.table
dfNew <- rbindlist(cityList)

# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]

写成一行,这是

# get the counts
rbindlist(lapply(seq_len(nrow(df)),
          function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
  by=.(City, Year)]

这里,lapply遍历每一行,构造一个data.table,其中重复的城市值作为一列,第二列是经营年限。此处,YearTo 递减,因此它不包括关闭年份。需要注意的是,在准备数据的时候,缺失值设置为2018年,所以包含了当年。

lapply returns data.tables 的列表,通过 rbindlist 组合成一个 data.table。此 data.table 汇总为城市-年份对,并使用 .N.

构建计数

这些return

    City Year Count
 1:   NY 2001     1
 2:   NY 2002     2
 3:   NY 2003     3
 4:   NY 2004     3
 5:   NY 2005     2
 6:   NY 2006     3
 7:   NY 2007     3
  ...
26:   LA 2012     3
27:   LA 2013     3
28:   LA 2014     3
29:   LA 2015     3
30:   LA 2016     3
31:   LA 2017     3
32:   LA 2002     1
33:   LA 2003     1

数据

setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) := 
   lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]

更短的 tidyverse 解决方案。

# Firsts some data prep
df <- mutate(df,
    YearFrom = as.numeric(as.character(YearFrom)),                     #Fix year coding
    YearTo = as.numeric(as.character(YearTo)),
    YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max

df %>% 
  mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>%          #Find all years
  unnest() %>%                                                 #Spread over rows
  count(Years, City) %>%                                       #Count them
  complete(City, Years, fill = list(n = 0))                    #Add in zeros, if needed

首先,清理数据...

curr_year = as.integer(year(Sys.Date()))

library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]

然后,非相等连接:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]

    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3