为多个时间序列创建 "Yesterday's Value" 个变量
Create "Yesterday's Value" variable for multiple time series
我正在用 R 做一个项目,我有点卡住了。我有四个这种格式的时间序列:
x <- data.frame(Id = rep(c(1,2,3,4),2),
Date = c(rep("1980-01-01",4), rep("1980-01-02",4)),
Freq = c(2,3,1,2,4,5,2,3))
ID Date Freq
1 1980 - 01 - 01 2
2 1980 - 01 - 01 3
3 1980 - 01 - 01 1
4 1980 - 01 - 01 2
1 1980 - 01 - 02 4
2 1980 - 01 - 02 5
3 1980 - 01 - 02 2
4 1980 - 01 - 02 3
我的目标是创建一个新变量,它只是该组昨天的频率值。
ID Date Freq YestFreq
1 1980 - 01 - 01 2 NA
2 1980 - 01 - 01 3 NA
3 1980 - 01 - 01 1 NA
4 1980 - 01 - 01 2 NA
1 1980 - 01 - 02 4 2
2 1980 - 01 - 02 5 3
3 1980 - 01 - 02 2 1
4 1980 - 01 - 02 3 2
我尝试的解决方案是:
x$DateID = paste(x$ID, x$Date)
x$yesterday = as.Date(x$Date) - 1
x$YesterdayDateID = paste(x$ID, x$yesterday)
result = numeric(nrow(x))
for(i in 1:nrow(x)){
answer = x$Freq[which(x$DateID == x$yesterdayDateID[i])]
if(length(answer) != 0){result[i] = answer} else{result[i] = NA}
}
x = cbind(x, result)
我的实际数据集有约 600000 行(约 300 个 ID 和约 2000 个唯一日期)所以我的上述解决方案需要整整 2 小时才能 运行。任何帮助将不胜感激。
我们可以试试
library(dplyr)
x %>%
arrange(as.Date(Date), Id) %>%
group_by(Id) %>%
mutate(YestFreq = lag(Freq))
# Id Date Freq YestFreq
# (dbl) (fctr) (dbl) (dbl)
#1 1 1980-01-01 2 NA
#2 2 1980-01-01 3 NA
#3 3 1980-01-01 1 NA
#4 4 1980-01-01 2 NA
#5 1 1980-01-02 4 2
#6 2 1980-01-02 5 3
#7 3 1980-01-02 2 1
#8 4 1980-01-02 3 2
要获得快速解决方案,请使用 data.table 程序包,对数据进行排序,并为每组派生一列,该列使用前一行的 Freq 值:
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
# The real solution starts here (could even be done in one row):
y <- setDT(x) # convert to data.table
setkey(y,Id,Date) # "sort" the data
y[, .(Date, Freq, YestFreq=c(NA, Freq[1:(.N-1)])), by=.(Id)]
输出是(不同顺序 -> 按 Id):
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 1 1980-01-02 4 2
3: 2 1980-01-01 3 NA
4: 2 1980-01-02 5 3
5: 3 1980-01-01 1 NA
6: 3 1980-01-02 2 1
7: 4 1980-01-01 2 NA
8: 4 1980-01-02 3 2
编辑 1:
您可以在一行中完成(并根据要求对结果进行排序):
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
setDT(x, key=c("Id", "Date"))[, YestFreq := c(NA, Freq[1:(.N-1)]), by=Id][order(Date, Id)]
结果:
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 2 1980-01-01 3 NA
3: 3 1980-01-01 1 NA
4: 4 1980-01-01 2 NA
5: 1 1980-01-02 4 2
6: 2 1980-01-02 5 3
7: 3 1980-01-02 2 1
8: 4 1980-01-02 3 2
考虑到昨天可能出现的跳空。我使用 match
来标识前一天。然后从该索引中按 Id 对目标列进行子集化:
data.table
library(data.table)
setDT(x)[, Date := as.IDate(Date)][
, YestFreq := Freq[match(Date-1L, Date)], by=Id][]
# Id Date Freq YestFreq
# 1: 1 1980-01-01 2 NA
# 2: 2 1980-01-01 3 NA
# 3: 3 1980-01-01 1 NA
# 4: 4 1980-01-01 2 NA
# 5: 1 1980-01-02 4 2
# 6: 2 1980-01-02 5 3
# 7: 3 1980-01-02 2 1
# 8: 4 1980-01-02 3 2
dplyr
library(dplyr)
x$Date <- as.Date(x$Date)
x %>% group_by(Id) %>% mutate(YestFreq = Freq[match(Date - 1L, Date)])
# Id Date Freq YestFreq
# 1 1 1980-01-01 2 NA
# 2 2 1980-01-01 3 NA
# 3 3 1980-01-01 1 NA
# 4 4 1980-01-01 2 NA
# 5 1 1980-01-02 4 2
# 6 2 1980-01-02 5 3
# 7 3 1980-01-02 2 1
# 8 4 1980-01-02 3 2
我正在用 R 做一个项目,我有点卡住了。我有四个这种格式的时间序列:
x <- data.frame(Id = rep(c(1,2,3,4),2),
Date = c(rep("1980-01-01",4), rep("1980-01-02",4)),
Freq = c(2,3,1,2,4,5,2,3))
ID Date Freq
1 1980 - 01 - 01 2
2 1980 - 01 - 01 3
3 1980 - 01 - 01 1
4 1980 - 01 - 01 2
1 1980 - 01 - 02 4
2 1980 - 01 - 02 5
3 1980 - 01 - 02 2
4 1980 - 01 - 02 3
我的目标是创建一个新变量,它只是该组昨天的频率值。
ID Date Freq YestFreq
1 1980 - 01 - 01 2 NA
2 1980 - 01 - 01 3 NA
3 1980 - 01 - 01 1 NA
4 1980 - 01 - 01 2 NA
1 1980 - 01 - 02 4 2
2 1980 - 01 - 02 5 3
3 1980 - 01 - 02 2 1
4 1980 - 01 - 02 3 2
我尝试的解决方案是:
x$DateID = paste(x$ID, x$Date)
x$yesterday = as.Date(x$Date) - 1
x$YesterdayDateID = paste(x$ID, x$yesterday)
result = numeric(nrow(x))
for(i in 1:nrow(x)){
answer = x$Freq[which(x$DateID == x$yesterdayDateID[i])]
if(length(answer) != 0){result[i] = answer} else{result[i] = NA}
}
x = cbind(x, result)
我的实际数据集有约 600000 行(约 300 个 ID 和约 2000 个唯一日期)所以我的上述解决方案需要整整 2 小时才能 运行。任何帮助将不胜感激。
我们可以试试
library(dplyr)
x %>%
arrange(as.Date(Date), Id) %>%
group_by(Id) %>%
mutate(YestFreq = lag(Freq))
# Id Date Freq YestFreq
# (dbl) (fctr) (dbl) (dbl)
#1 1 1980-01-01 2 NA
#2 2 1980-01-01 3 NA
#3 3 1980-01-01 1 NA
#4 4 1980-01-01 2 NA
#5 1 1980-01-02 4 2
#6 2 1980-01-02 5 3
#7 3 1980-01-02 2 1
#8 4 1980-01-02 3 2
要获得快速解决方案,请使用 data.table 程序包,对数据进行排序,并为每组派生一列,该列使用前一行的 Freq 值:
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
# The real solution starts here (could even be done in one row):
y <- setDT(x) # convert to data.table
setkey(y,Id,Date) # "sort" the data
y[, .(Date, Freq, YestFreq=c(NA, Freq[1:(.N-1)])), by=.(Id)]
输出是(不同顺序 -> 按 Id):
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 1 1980-01-02 4 2
3: 2 1980-01-01 3 NA
4: 2 1980-01-02 5 3
5: 3 1980-01-01 1 NA
6: 3 1980-01-02 2 1
7: 4 1980-01-01 2 NA
8: 4 1980-01-02 3 2
编辑 1:
您可以在一行中完成(并根据要求对结果进行排序):
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
setDT(x, key=c("Id", "Date"))[, YestFreq := c(NA, Freq[1:(.N-1)]), by=Id][order(Date, Id)]
结果:
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 2 1980-01-01 3 NA
3: 3 1980-01-01 1 NA
4: 4 1980-01-01 2 NA
5: 1 1980-01-02 4 2
6: 2 1980-01-02 5 3
7: 3 1980-01-02 2 1
8: 4 1980-01-02 3 2
考虑到昨天可能出现的跳空。我使用 match
来标识前一天。然后从该索引中按 Id 对目标列进行子集化:
data.table
library(data.table)
setDT(x)[, Date := as.IDate(Date)][
, YestFreq := Freq[match(Date-1L, Date)], by=Id][]
# Id Date Freq YestFreq
# 1: 1 1980-01-01 2 NA
# 2: 2 1980-01-01 3 NA
# 3: 3 1980-01-01 1 NA
# 4: 4 1980-01-01 2 NA
# 5: 1 1980-01-02 4 2
# 6: 2 1980-01-02 5 3
# 7: 3 1980-01-02 2 1
# 8: 4 1980-01-02 3 2
dplyr
library(dplyr)
x$Date <- as.Date(x$Date)
x %>% group_by(Id) %>% mutate(YestFreq = Freq[match(Date - 1L, Date)])
# Id Date Freq YestFreq
# 1 1 1980-01-01 2 NA
# 2 2 1980-01-01 3 NA
# 3 3 1980-01-01 1 NA
# 4 4 1980-01-01 2 NA
# 5 1 1980-01-02 4 2
# 6 2 1980-01-02 5 3
# 7 3 1980-01-02 2 1
# 8 4 1980-01-02 3 2