在时间序列中插入 NA 以获取正确的线图

Inserting NA for missing observation in time series for correct line plot

我有不同组的时间序列,例如缺少某些值的地方:

library(tidyverse)

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  # 2006 for USA is missing!
  # 2007 for USA is missing!
  # 2008 for USA is missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  # 2003 for FRA is missing!
  # 2004 for FRA is missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39
  # 2011 for FRA is missing!
  # 2012 for FRA is missing!
)

当我绘制我的系列时,然后 geom_line() 连接线,即使我一年没有观察到:

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

"FRA"看起来不错,因为丢失的数据在开头和结尾,但是"US"我不想连接 2006 年到 2008 年的线。

我想要的是以下内容:

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  2006, "USA", NA, # explicitly missing!
  2007, "USA", NA, # explicitly missing!
  2008, "USA", NA, # explicitly missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  2003, "FRA", NA, # explicitly missing!
  2004, "FRA", NA, # explicitly missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39,
  2011, "FRA", NA, # explicitly missing!
  2012, "FRA", NA # explicitly missing!
)

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

这使得:

在我现实生活中的数据集中有很多组和日期,所以只是在正确的位置手动插入 NAs 不是一种选择。

我尝试与正确的日期列表进行一些合并,但这并没有解决问题:

df %>% 
  right_join(tibble(year = seq(2003, 2012)))

有什么想法吗?

您可以使用 expand.grid 在数据框中自动创建缺失值:

df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)

ggplot(df2, aes(year, variable, color = country)) +
  geom_line()

df2 将如下所示:

   year country variable
1  2003     USA       44
2  2004     USA       40
3  2005     USA       30
4  2009     USA       39
5  2010     USA       55
6  2011     USA       53
7  2012     USA       71
8  2006     USA       NA
9  2007     USA       NA
10 2008     USA       NA
11 2003     FRA       NA
12 2004     FRA       NA
13 2005     FRA       10
14 2009     FRA       18
15 2010     FRA       39
16 2011     FRA       NA
17 2012     FRA       NA
18 2006     FRA        8
19 2007     FRA       13
20 2008     FRA       12

和结果图:

希望对您有所帮助!

问题不在于 ggplot,而在于您的数据。解决方案是在绘制数据之前进行合并。创建包含所有年份和国家/地区的数据集。

例如all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")

然后,合并您的真实数据集和这个完整的数据集 (all_yr)。 merge 应包括 all_yr 数据集中包含的所有年份和国家/地区。 real_data 集合中缺失的那些将用 NA 填充。

例如merge(all_yr, real_data, by= year, all.x = TRUE)

这对我有用:

set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
                 country = c(rep("USA", 7), rep("FR", 6)),
                 vrbl = rnorm(7+6))

sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)

out <- sapply(sxy, FUN = function(x, mxy) {
  out <- merge(x = mxy, y = x, all = TRUE)
  out$country <- unique(x$country)
  out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)

library(ggplot2)

ggplot(out, aes(x = year, y = vrbl, color = country)) +
  theme_bw() +
  geom_line()

       year country        vrbl
FR.1   2003      FR          NA
FR.2   2004      FR          NA
FR.3   2005      FR  0.22703071
FR.4   2006      FR -0.46901506
FR.5   2007      FR  0.47652129
FR.6   2008      FR -0.91164798
FR.7   2009      FR -0.34177516
FR.8   2010      FR  0.54674134
FR.9   2011      FR          NA
FR.10  2012      FR          NA
USA.1  2003     USA -1.24111731
USA.2  2004     USA -0.58320499
USA.3  2005     USA  0.39474705
USA.4  2006     USA          NA
USA.5  2007     USA          NA
USA.6  2008     USA          NA
USA.7  2009     USA  1.50421107
USA.8  2010     USA  0.76679974
USA.9  2011     USA  0.31746044
USA.10 2012     USA -0.09997594