为缺失数据插入行并进行插值
Insert rows for missing data and interpolate
我在 R 中有以下数据框:
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2017-06-01 15033000
4 2017-11-01 24033000
5 2019-05-01 24533000
6 2019-08-01 25033000
7 2019-11-01 27533000
8 2020-06-01 29033000
我有兴趣完成“数据”列中缺失月份的行,同时在“累积”列中应用线性或样条插值(最好是样条插值)(即,我需要 2016 年的行-12-01、2017-01-01、2017-02-01、2017-03-01 等等)。
我看到另一个问题,人们建议使用“zoo”和“data.table”包,他们首先用“NA”创建行,然后应用插值...但我不是确定如何执行此操作,因为我的数据组织方式不同(我的所有日期数据都在一列中,与本例相反,例如:)。然而,我对 R 还是比较陌生,管理不同类型和 类 的数据对我来说非常困难。我相信有一种简单的方法可以做到这一点。
非常感谢。
这可能有助于使用样条曲线:
library(zoo)
#Data
df <- structure(list(Date = structure(c(17075, 17106, 17318, 17471,
18017, 18109, 18201, 18414), class = "Date"), Accumulated = c(6902000L,
9033000L, 15033000L, 24033000L, 24533000L, 25033000L, 27533000L,
29033000L)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "data.frame")
#Create seq of dates
df$Date <- as.Date(df$Date)
dfm <- data.frame(Date=seq(min(df$Date),max(df$Date),by='1 month'))
#Now merge
dfmerged <- merge(dfm,df,by = 'Date',all.x=T)
#Now add interpolation
dfmerged$Interpolation <- na.spline(dfmerged$Accumulated)
它将产生:
Date Accumulated Interpolation
1 2016-10-01 6902000 6902000
2 2016-11-01 9033000 9033000
3 2016-12-01 NA 10525685
4 2017-01-01 NA 11534406
5 2017-02-01 NA 12222432
6 2017-03-01 NA 12753035
7 2017-04-01 NA 13289484
8 2017-05-01 NA 13995049
9 2017-06-01 15033000 15033000
10 2017-07-01 NA 16511487
11 2017-08-01 NA 18318181
12 2017-09-01 NA 20285631
13 2017-10-01 NA 22246387
14 2017-11-01 24033000 24033000
15 2017-12-01 NA 25510428
16 2018-01-01 NA 26673271
17 2018-02-01 NA 27548534
18 2018-03-01 NA 28163225
19 2018-04-01 NA 28544352
20 2018-05-01 NA 28718923
21 2018-06-01 NA 28713943
22 2018-07-01 NA 28556422
23 2018-08-01 NA 28273365
24 2018-09-01 NA 27891781
25 2018-10-01 NA 27438677
26 2018-11-01 NA 26941060
27 2018-12-01 NA 26425938
28 2019-01-01 NA 25920317
29 2019-02-01 NA 25451205
30 2019-03-01 NA 25045611
31 2019-04-01 NA 24730540
32 2019-05-01 24533000 24533000
33 2019-06-01 NA 24484346
34 2019-07-01 NA 24633317
35 2019-08-01 25033000 25033000
36 2019-09-01 NA 25709290
37 2019-10-01 NA 26579313
38 2019-11-01 27533000 27533000
39 2019-12-01 NA 28465321
40 2020-01-01 NA 29291385
41 2020-02-01 NA 29931341
42 2020-03-01 NA 30305333
43 2020-04-01 NA 30333510
44 2020-05-01 NA 29936017
45 2020-06-01 29033000 29033000
您可以尝试 spline
从基础 R 开始,如下所示
xout <- seq(as.Date("2016-10-01"), as.Date("2020-06-01"), by = "1 month")
yout <- with(df, spline(Date, Accumulated, xout = xout)$y)
setNames(data.frame(xout,yout),names(df))
这样
> setNames(data.frame(xout,yout),names(df))
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2016-12-01 10482841
4 2017-01-01 11503192
5 2017-02-01 12204935
6 2017-03-01 12705371
7 2017-04-01 13267237
8 2017-05-01 13972655
9 2017-06-01 15033000
10 2017-07-01 16485476
11 2017-08-01 18315168
12 2017-09-01 20307491
13 2017-10-01 22227042
14 2017-11-01 24033000
15 2017-12-01 25477768
16 2018-01-01 26651692
17 2018-02-01 27529507
18 2018-03-01 28091508
19 2018-04-01 28484305
20 2018-05-01 28660790
21 2018-06-01 28660401
22 2018-07-01 28509648
23 2018-08-01 28226152
24 2018-09-01 27840967
25 2018-10-01 27398164
26 2018-11-01 26895893
27 2018-12-01 26393045
28 2019-01-01 25883766
29 2019-02-01 25413112
30 2019-03-01 25044851
31 2019-04-01 24726252
32 2019-05-01 24533000
33 2019-06-01 24484235
34 2019-07-01 24629969
35 2019-08-01 25033000
36 2019-09-01 25718441
37 2019-10-01 26569896
38 2019-11-01 27533000
39 2019-12-01 28443968
40 2020-01-01 29277623
41 2020-02-01 29919811
42 2020-03-01 30273784
43 2020-04-01 30309852
44 2020-05-01 29931563
45 2020-06-01 29033000
数据
df <- structure(list(Date = structure(c(17075, 17106, 17318, 17471,
18017, 18109, 18201, 18414), class = "Date"), Accumulated = c(6902000L,
9033000L, 15033000L, 24033000L, 24533000L, 25033000L, 27533000L,
29033000L)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "data.frame")
以下基本 R 解决方案使用 approxfun
创建插值函数。
df1$Date <- as.Date(df1$Date)
f <- approxfun(df1$Date, df1$Accumulated)
d <- seq(min(df1$Date), max(df1$Date), by = "month")
df2 <- data.frame(Date = d, Accumulated = f(d))
要查看结果,我将使用包 ggplot2
绘制结果。
library(ggplot2)
ggplot(df2, aes(Date, Accumulated)) +
geom_point() +
geom_line() +
geom_point(data = df1, aes(Date, Accumulated), colour = "blue")
编辑
继 之后,这里是 splinefun
的解决方案。
g <- splinefun(df1$Date, df1$Accumulated)
d <- seq(min(df1$Date), max(df1$Date), by = "month")
df3 <- data.frame(Date = d, Accumulated = g(d))
library(ggplot2)
ggplot(df3, aes(Date, Accumulated)) +
geom_point() +
geom_line() +
geom_point(data = df1, aes(Date, Accumulated), colour = "blue")
数据
df1 <- read.table(text = "
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2017-06-01 15033000
4 2017-11-01 24033000
5 2019-05-01 24533000
6 2019-08-01 25033000
7 2019-11-01 27533000
8 2020-06-01 29033000
", header = TRUE)
我在 R 中有以下数据框:
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2017-06-01 15033000
4 2017-11-01 24033000
5 2019-05-01 24533000
6 2019-08-01 25033000
7 2019-11-01 27533000
8 2020-06-01 29033000
我有兴趣完成“数据”列中缺失月份的行,同时在“累积”列中应用线性或样条插值(最好是样条插值)(即,我需要 2016 年的行-12-01、2017-01-01、2017-02-01、2017-03-01 等等)。
我看到另一个问题,人们建议使用“zoo”和“data.table”包,他们首先用“NA”创建行,然后应用插值...但我不是确定如何执行此操作,因为我的数据组织方式不同(我的所有日期数据都在一列中,与本例相反,例如:
非常感谢。
这可能有助于使用样条曲线:
library(zoo)
#Data
df <- structure(list(Date = structure(c(17075, 17106, 17318, 17471,
18017, 18109, 18201, 18414), class = "Date"), Accumulated = c(6902000L,
9033000L, 15033000L, 24033000L, 24533000L, 25033000L, 27533000L,
29033000L)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "data.frame")
#Create seq of dates
df$Date <- as.Date(df$Date)
dfm <- data.frame(Date=seq(min(df$Date),max(df$Date),by='1 month'))
#Now merge
dfmerged <- merge(dfm,df,by = 'Date',all.x=T)
#Now add interpolation
dfmerged$Interpolation <- na.spline(dfmerged$Accumulated)
它将产生:
Date Accumulated Interpolation
1 2016-10-01 6902000 6902000
2 2016-11-01 9033000 9033000
3 2016-12-01 NA 10525685
4 2017-01-01 NA 11534406
5 2017-02-01 NA 12222432
6 2017-03-01 NA 12753035
7 2017-04-01 NA 13289484
8 2017-05-01 NA 13995049
9 2017-06-01 15033000 15033000
10 2017-07-01 NA 16511487
11 2017-08-01 NA 18318181
12 2017-09-01 NA 20285631
13 2017-10-01 NA 22246387
14 2017-11-01 24033000 24033000
15 2017-12-01 NA 25510428
16 2018-01-01 NA 26673271
17 2018-02-01 NA 27548534
18 2018-03-01 NA 28163225
19 2018-04-01 NA 28544352
20 2018-05-01 NA 28718923
21 2018-06-01 NA 28713943
22 2018-07-01 NA 28556422
23 2018-08-01 NA 28273365
24 2018-09-01 NA 27891781
25 2018-10-01 NA 27438677
26 2018-11-01 NA 26941060
27 2018-12-01 NA 26425938
28 2019-01-01 NA 25920317
29 2019-02-01 NA 25451205
30 2019-03-01 NA 25045611
31 2019-04-01 NA 24730540
32 2019-05-01 24533000 24533000
33 2019-06-01 NA 24484346
34 2019-07-01 NA 24633317
35 2019-08-01 25033000 25033000
36 2019-09-01 NA 25709290
37 2019-10-01 NA 26579313
38 2019-11-01 27533000 27533000
39 2019-12-01 NA 28465321
40 2020-01-01 NA 29291385
41 2020-02-01 NA 29931341
42 2020-03-01 NA 30305333
43 2020-04-01 NA 30333510
44 2020-05-01 NA 29936017
45 2020-06-01 29033000 29033000
您可以尝试 spline
从基础 R 开始,如下所示
xout <- seq(as.Date("2016-10-01"), as.Date("2020-06-01"), by = "1 month")
yout <- with(df, spline(Date, Accumulated, xout = xout)$y)
setNames(data.frame(xout,yout),names(df))
这样
> setNames(data.frame(xout,yout),names(df))
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2016-12-01 10482841
4 2017-01-01 11503192
5 2017-02-01 12204935
6 2017-03-01 12705371
7 2017-04-01 13267237
8 2017-05-01 13972655
9 2017-06-01 15033000
10 2017-07-01 16485476
11 2017-08-01 18315168
12 2017-09-01 20307491
13 2017-10-01 22227042
14 2017-11-01 24033000
15 2017-12-01 25477768
16 2018-01-01 26651692
17 2018-02-01 27529507
18 2018-03-01 28091508
19 2018-04-01 28484305
20 2018-05-01 28660790
21 2018-06-01 28660401
22 2018-07-01 28509648
23 2018-08-01 28226152
24 2018-09-01 27840967
25 2018-10-01 27398164
26 2018-11-01 26895893
27 2018-12-01 26393045
28 2019-01-01 25883766
29 2019-02-01 25413112
30 2019-03-01 25044851
31 2019-04-01 24726252
32 2019-05-01 24533000
33 2019-06-01 24484235
34 2019-07-01 24629969
35 2019-08-01 25033000
36 2019-09-01 25718441
37 2019-10-01 26569896
38 2019-11-01 27533000
39 2019-12-01 28443968
40 2020-01-01 29277623
41 2020-02-01 29919811
42 2020-03-01 30273784
43 2020-04-01 30309852
44 2020-05-01 29931563
45 2020-06-01 29033000
数据
df <- structure(list(Date = structure(c(17075, 17106, 17318, 17471,
18017, 18109, 18201, 18414), class = "Date"), Accumulated = c(6902000L,
9033000L, 15033000L, 24033000L, 24533000L, 25033000L, 27533000L,
29033000L)), row.names = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "data.frame")
以下基本 R 解决方案使用 approxfun
创建插值函数。
df1$Date <- as.Date(df1$Date)
f <- approxfun(df1$Date, df1$Accumulated)
d <- seq(min(df1$Date), max(df1$Date), by = "month")
df2 <- data.frame(Date = d, Accumulated = f(d))
要查看结果,我将使用包 ggplot2
绘制结果。
library(ggplot2)
ggplot(df2, aes(Date, Accumulated)) +
geom_point() +
geom_line() +
geom_point(data = df1, aes(Date, Accumulated), colour = "blue")
编辑
继 splinefun
的解决方案。
g <- splinefun(df1$Date, df1$Accumulated)
d <- seq(min(df1$Date), max(df1$Date), by = "month")
df3 <- data.frame(Date = d, Accumulated = g(d))
library(ggplot2)
ggplot(df3, aes(Date, Accumulated)) +
geom_point() +
geom_line() +
geom_point(data = df1, aes(Date, Accumulated), colour = "blue")
数据
df1 <- read.table(text = "
Date Accumulated
1 2016-10-01 6902000
2 2016-11-01 9033000
3 2017-06-01 15033000
4 2017-11-01 24033000
5 2019-05-01 24533000
6 2019-08-01 25033000
7 2019-11-01 27533000
8 2020-06-01 29033000
", header = TRUE)