在时间序列中插入 NA 以获取正确的线图
Inserting NA for missing observation in time series for correct line plot
我有不同组的时间序列,例如缺少某些值的地方:
library(tidyverse)
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
# 2006 for USA is missing!
# 2007 for USA is missing!
# 2008 for USA is missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
# 2003 for FRA is missing!
# 2004 for FRA is missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39
# 2011 for FRA is missing!
# 2012 for FRA is missing!
)
当我绘制我的系列时,然后 geom_line()
连接线,即使我一年没有观察到:
ggplot(df, aes(year, variable, color = country)) +
geom_line()
"FRA"看起来不错,因为丢失的数据在开头和结尾,但是"US"我不想连接 2006 年到 2008 年的线。
我想要的是以下内容:
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
2006, "USA", NA, # explicitly missing!
2007, "USA", NA, # explicitly missing!
2008, "USA", NA, # explicitly missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
2003, "FRA", NA, # explicitly missing!
2004, "FRA", NA, # explicitly missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39,
2011, "FRA", NA, # explicitly missing!
2012, "FRA", NA # explicitly missing!
)
ggplot(df, aes(year, variable, color = country)) +
geom_line()
这使得:
在我现实生活中的数据集中有很多组和日期,所以只是在正确的位置手动插入 NA
s 不是一种选择。
我尝试与正确的日期列表进行一些合并,但这并没有解决问题:
df %>%
right_join(tibble(year = seq(2003, 2012)))
有什么想法吗?
您可以使用 expand.grid 在数据框中自动创建缺失值:
df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)
ggplot(df2, aes(year, variable, color = country)) +
geom_line()
df2 将如下所示:
year country variable
1 2003 USA 44
2 2004 USA 40
3 2005 USA 30
4 2009 USA 39
5 2010 USA 55
6 2011 USA 53
7 2012 USA 71
8 2006 USA NA
9 2007 USA NA
10 2008 USA NA
11 2003 FRA NA
12 2004 FRA NA
13 2005 FRA 10
14 2009 FRA 18
15 2010 FRA 39
16 2011 FRA NA
17 2012 FRA NA
18 2006 FRA 8
19 2007 FRA 13
20 2008 FRA 12
和结果图:
希望对您有所帮助!
问题不在于 ggplot
,而在于您的数据。解决方案是在绘制数据之前进行合并。创建包含所有年份和国家/地区的数据集。
例如all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")
然后,合并您的真实数据集和这个完整的数据集 (all_yr
)。 merge
应包括 all_yr
数据集中包含的所有年份和国家/地区。 real_data
集合中缺失的那些将用 NA
填充。
例如merge(all_yr, real_data, by= year, all.x = TRUE)
这对我有用:
set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
country = c(rep("USA", 7), rep("FR", 6)),
vrbl = rnorm(7+6))
sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)
out <- sapply(sxy, FUN = function(x, mxy) {
out <- merge(x = mxy, y = x, all = TRUE)
out$country <- unique(x$country)
out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)
library(ggplot2)
ggplot(out, aes(x = year, y = vrbl, color = country)) +
theme_bw() +
geom_line()
year country vrbl
FR.1 2003 FR NA
FR.2 2004 FR NA
FR.3 2005 FR 0.22703071
FR.4 2006 FR -0.46901506
FR.5 2007 FR 0.47652129
FR.6 2008 FR -0.91164798
FR.7 2009 FR -0.34177516
FR.8 2010 FR 0.54674134
FR.9 2011 FR NA
FR.10 2012 FR NA
USA.1 2003 USA -1.24111731
USA.2 2004 USA -0.58320499
USA.3 2005 USA 0.39474705
USA.4 2006 USA NA
USA.5 2007 USA NA
USA.6 2008 USA NA
USA.7 2009 USA 1.50421107
USA.8 2010 USA 0.76679974
USA.9 2011 USA 0.31746044
USA.10 2012 USA -0.09997594
我有不同组的时间序列,例如缺少某些值的地方:
library(tidyverse)
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
# 2006 for USA is missing!
# 2007 for USA is missing!
# 2008 for USA is missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
# 2003 for FRA is missing!
# 2004 for FRA is missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39
# 2011 for FRA is missing!
# 2012 for FRA is missing!
)
当我绘制我的系列时,然后 geom_line()
连接线,即使我一年没有观察到:
ggplot(df, aes(year, variable, color = country)) +
geom_line()
"FRA"看起来不错,因为丢失的数据在开头和结尾,但是"US"我不想连接 2006 年到 2008 年的线。
我想要的是以下内容:
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
2006, "USA", NA, # explicitly missing!
2007, "USA", NA, # explicitly missing!
2008, "USA", NA, # explicitly missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
2003, "FRA", NA, # explicitly missing!
2004, "FRA", NA, # explicitly missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39,
2011, "FRA", NA, # explicitly missing!
2012, "FRA", NA # explicitly missing!
)
ggplot(df, aes(year, variable, color = country)) +
geom_line()
这使得:
在我现实生活中的数据集中有很多组和日期,所以只是在正确的位置手动插入 NA
s 不是一种选择。
我尝试与正确的日期列表进行一些合并,但这并没有解决问题:
df %>%
right_join(tibble(year = seq(2003, 2012)))
有什么想法吗?
您可以使用 expand.grid 在数据框中自动创建缺失值:
df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)
ggplot(df2, aes(year, variable, color = country)) +
geom_line()
df2 将如下所示:
year country variable
1 2003 USA 44
2 2004 USA 40
3 2005 USA 30
4 2009 USA 39
5 2010 USA 55
6 2011 USA 53
7 2012 USA 71
8 2006 USA NA
9 2007 USA NA
10 2008 USA NA
11 2003 FRA NA
12 2004 FRA NA
13 2005 FRA 10
14 2009 FRA 18
15 2010 FRA 39
16 2011 FRA NA
17 2012 FRA NA
18 2006 FRA 8
19 2007 FRA 13
20 2008 FRA 12
和结果图:
希望对您有所帮助!
问题不在于 ggplot
,而在于您的数据。解决方案是在绘制数据之前进行合并。创建包含所有年份和国家/地区的数据集。
例如all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")
然后,合并您的真实数据集和这个完整的数据集 (all_yr
)。 merge
应包括 all_yr
数据集中包含的所有年份和国家/地区。 real_data
集合中缺失的那些将用 NA
填充。
例如merge(all_yr, real_data, by= year, all.x = TRUE)
这对我有用:
set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
country = c(rep("USA", 7), rep("FR", 6)),
vrbl = rnorm(7+6))
sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)
out <- sapply(sxy, FUN = function(x, mxy) {
out <- merge(x = mxy, y = x, all = TRUE)
out$country <- unique(x$country)
out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)
library(ggplot2)
ggplot(out, aes(x = year, y = vrbl, color = country)) +
theme_bw() +
geom_line()
year country vrbl
FR.1 2003 FR NA
FR.2 2004 FR NA
FR.3 2005 FR 0.22703071
FR.4 2006 FR -0.46901506
FR.5 2007 FR 0.47652129
FR.6 2008 FR -0.91164798
FR.7 2009 FR -0.34177516
FR.8 2010 FR 0.54674134
FR.9 2011 FR NA
FR.10 2012 FR NA
USA.1 2003 USA -1.24111731
USA.2 2004 USA -0.58320499
USA.3 2005 USA 0.39474705
USA.4 2006 USA NA
USA.5 2007 USA NA
USA.6 2008 USA NA
USA.7 2009 USA 1.50421107
USA.8 2010 USA 0.76679974
USA.9 2011 USA 0.31746044
USA.10 2012 USA -0.09997594