如何使用 R 中现有列中前一行的值创建一个新列
How to create a new column with values from previous row in exsisting column, in a tibble, in R
我想创建一个新列,其中包含相同 ID 的前一个时间段的最后一个值,与下一个时间段的第一个值位于同一行。如果没有上一期 NA 应该应用。
但是,我在任何包中都找不到任何函数来为我解决这个问题,所以我希望我必须编写一个循环?
有没有人知道如何以整洁的方式解决这个问题(有或没有循环),可以应用于大的 tibble(+400 万次观察)?
我的数据排序如下df,目标是df1:
df <- tibble(
ID = rep(c(77,88,99),each=6),
PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
RESULT = seq(from = 10, to = 44, by = 2)
)
df
# A tibble: 18 x 4
ID PERIOD DATE RESULT
<dbl> <dbl> <date> <dbl>
1 77 1 2020-06-01 10
2 77 1 2020-06-02 12
3 77 2 2020-06-03 14
4 77 2 2020-06-04 16
5 77 3 2020-06-05 18
6 77 3 2020-06-06 20
7 88 1 2020-06-07 22
8 88 1 2020-06-08 24
9 88 2 2020-06-09 26
10 88 2 2020-06-10 28
11 88 3 2020-06-11 30
12 88 3 2020-06-12 32
13 99 1 2020-06-13 34
14 99 1 2020-06-14 36
15 99 2 2020-06-15 38
16 99 2 2020-06-16 40
17 99 3 2020-06-17 42
18 99 3 2020-06-18 44
df1 <- tibble(
ID = rep(c(77,88,99),each=6),
PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
RESULT = seq(from = 10, to = 44, by = 2),
RESULT_post = c("NA","NA",12,"NA",16,"NA","NA","NA",24,"NA",28,
"NA","NA", "NA",36, "NA",40, "NA" )
)
df1
# A tibble: 18 x 5
ID PERIOD DATE RESULT RESULT_pre
<dbl> <dbl> <date> <dbl> <chr>
1 77 1 2020-06-01 10 NA
2 77 1 2020-06-02 12 NA
3 77 2 2020-06-03 14 12
4 77 2 2020-06-04 16 NA
5 77 3 2020-06-05 18 16
6 77 3 2020-06-06 20 NA
7 88 1 2020-06-07 22 NA
8 88 1 2020-06-08 24 NA
9 88 2 2020-06-09 26 24
10 88 2 2020-06-10 28 NA
11 88 3 2020-06-11 30 28
12 88 3 2020-06-12 32 NA
13 99 1 2020-06-13 34 NA
14 99 1 2020-06-14 36 NA
15 99 2 2020-06-15 38 36
16 99 2 2020-06-16 40 NA
17 99 3 2020-06-17 42 40
18 99 3 2020-06-18 44 NA
感谢所有意见
谢谢/索菲亚
这是 dplyr
的一种方式:
library(dplyr)
df %>%
group_by(ID, PERIOD) %>%
summarise(RESULT_pre = last(RESULT)) %>%
mutate(RESULT_pre = lag(RESULT_pre)) %>%
left_join(df, by = c('ID', 'PERIOD')) %>%
group_by(ID, PERIOD) %>%
mutate(RESULT_pre = replace(RESULT_pre, -1, NA)) %>%
select(-RESULT_pre, RESULT_pre)
# ID PERIOD DATE RESULT RESULT_pre
# <dbl> <dbl> <date> <dbl> <dbl>
# 1 77 1 2020-06-01 10 NA
# 2 77 1 2020-06-02 12 NA
# 3 77 2 2020-06-03 14 12
# 4 77 2 2020-06-04 16 NA
# 5 77 3 2020-06-05 18 16
# 6 77 3 2020-06-06 20 NA
# 7 88 1 2020-06-07 22 NA
# 8 88 1 2020-06-08 24 NA
# 9 88 2 2020-06-09 26 24
#10 88 2 2020-06-10 28 NA
#11 88 3 2020-06-11 30 28
#12 88 3 2020-06-12 32 NA
#13 99 1 2020-06-13 34 NA
#14 99 1 2020-06-14 36 NA
#15 99 2 2020-06-15 38 36
#16 99 2 2020-06-16 40 NA
#17 99 3 2020-06-17 42 40
#18 99 3 2020-06-18 44 NA
这里的逻辑是为每个ID
和PERIOD
汇总last
RESULT
值,并使用lag
移动每个[=中的值14=]。我们将这个结果与原始数据集连接起来,只保留每组中的第一个值,并将所有其他值替换为 NA
.
您可以复制所有移位的值并覆盖那些不符合 NA
:
的值
n <- nrow(df)
df$RESULT_pre <- c(NA, df$RESULT[-n])
df$RESULT_pre[c(FALSE, df$ID[-1] != df$ID[-n] |
df$PERIOD[-1] == df$PERIOD[-n])] <- NA
df
# ID PERIOD DATE RESULT RESULT_pre
#1 77 1 2020-06-01 10 NA
#2 77 1 2020-06-02 12 NA
#3 77 2 2020-06-03 14 12
#4 77 2 2020-06-04 16 NA
#5 77 3 2020-06-05 18 16
#6 77 3 2020-06-06 20 NA
#7 88 1 2020-06-07 22 NA
#8 88 1 2020-06-08 24 NA
#9 88 2 2020-06-09 26 24
#10 88 2 2020-06-10 28 NA
#11 88 3 2020-06-11 30 28
#12 88 3 2020-06-12 32 NA
#13 99 1 2020-06-13 34 NA
#14 99 1 2020-06-14 36 NA
#15 99 2 2020-06-15 38 36
#16 99 2 2020-06-16 40 NA
#17 99 3 2020-06-17 42 40
#18 99 3 2020-06-18 44 NA
我想创建一个新列,其中包含相同 ID 的前一个时间段的最后一个值,与下一个时间段的第一个值位于同一行。如果没有上一期 NA 应该应用。
但是,我在任何包中都找不到任何函数来为我解决这个问题,所以我希望我必须编写一个循环?
有没有人知道如何以整洁的方式解决这个问题(有或没有循环),可以应用于大的 tibble(+400 万次观察)?
我的数据排序如下df,目标是df1:
df <- tibble(
ID = rep(c(77,88,99),each=6),
PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
RESULT = seq(from = 10, to = 44, by = 2)
)
df
# A tibble: 18 x 4
ID PERIOD DATE RESULT
<dbl> <dbl> <date> <dbl>
1 77 1 2020-06-01 10
2 77 1 2020-06-02 12
3 77 2 2020-06-03 14
4 77 2 2020-06-04 16
5 77 3 2020-06-05 18
6 77 3 2020-06-06 20
7 88 1 2020-06-07 22
8 88 1 2020-06-08 24
9 88 2 2020-06-09 26
10 88 2 2020-06-10 28
11 88 3 2020-06-11 30
12 88 3 2020-06-12 32
13 99 1 2020-06-13 34
14 99 1 2020-06-14 36
15 99 2 2020-06-15 38
16 99 2 2020-06-16 40
17 99 3 2020-06-17 42
18 99 3 2020-06-18 44
df1 <- tibble(
ID = rep(c(77,88,99),each=6),
PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
RESULT = seq(from = 10, to = 44, by = 2),
RESULT_post = c("NA","NA",12,"NA",16,"NA","NA","NA",24,"NA",28,
"NA","NA", "NA",36, "NA",40, "NA" )
)
df1
# A tibble: 18 x 5
ID PERIOD DATE RESULT RESULT_pre
<dbl> <dbl> <date> <dbl> <chr>
1 77 1 2020-06-01 10 NA
2 77 1 2020-06-02 12 NA
3 77 2 2020-06-03 14 12
4 77 2 2020-06-04 16 NA
5 77 3 2020-06-05 18 16
6 77 3 2020-06-06 20 NA
7 88 1 2020-06-07 22 NA
8 88 1 2020-06-08 24 NA
9 88 2 2020-06-09 26 24
10 88 2 2020-06-10 28 NA
11 88 3 2020-06-11 30 28
12 88 3 2020-06-12 32 NA
13 99 1 2020-06-13 34 NA
14 99 1 2020-06-14 36 NA
15 99 2 2020-06-15 38 36
16 99 2 2020-06-16 40 NA
17 99 3 2020-06-17 42 40
18 99 3 2020-06-18 44 NA
感谢所有意见
谢谢/索菲亚
这是 dplyr
的一种方式:
library(dplyr)
df %>%
group_by(ID, PERIOD) %>%
summarise(RESULT_pre = last(RESULT)) %>%
mutate(RESULT_pre = lag(RESULT_pre)) %>%
left_join(df, by = c('ID', 'PERIOD')) %>%
group_by(ID, PERIOD) %>%
mutate(RESULT_pre = replace(RESULT_pre, -1, NA)) %>%
select(-RESULT_pre, RESULT_pre)
# ID PERIOD DATE RESULT RESULT_pre
# <dbl> <dbl> <date> <dbl> <dbl>
# 1 77 1 2020-06-01 10 NA
# 2 77 1 2020-06-02 12 NA
# 3 77 2 2020-06-03 14 12
# 4 77 2 2020-06-04 16 NA
# 5 77 3 2020-06-05 18 16
# 6 77 3 2020-06-06 20 NA
# 7 88 1 2020-06-07 22 NA
# 8 88 1 2020-06-08 24 NA
# 9 88 2 2020-06-09 26 24
#10 88 2 2020-06-10 28 NA
#11 88 3 2020-06-11 30 28
#12 88 3 2020-06-12 32 NA
#13 99 1 2020-06-13 34 NA
#14 99 1 2020-06-14 36 NA
#15 99 2 2020-06-15 38 36
#16 99 2 2020-06-16 40 NA
#17 99 3 2020-06-17 42 40
#18 99 3 2020-06-18 44 NA
这里的逻辑是为每个ID
和PERIOD
汇总last
RESULT
值,并使用lag
移动每个[=中的值14=]。我们将这个结果与原始数据集连接起来,只保留每组中的第一个值,并将所有其他值替换为 NA
.
您可以复制所有移位的值并覆盖那些不符合 NA
:
n <- nrow(df)
df$RESULT_pre <- c(NA, df$RESULT[-n])
df$RESULT_pre[c(FALSE, df$ID[-1] != df$ID[-n] |
df$PERIOD[-1] == df$PERIOD[-n])] <- NA
df
# ID PERIOD DATE RESULT RESULT_pre
#1 77 1 2020-06-01 10 NA
#2 77 1 2020-06-02 12 NA
#3 77 2 2020-06-03 14 12
#4 77 2 2020-06-04 16 NA
#5 77 3 2020-06-05 18 16
#6 77 3 2020-06-06 20 NA
#7 88 1 2020-06-07 22 NA
#8 88 1 2020-06-08 24 NA
#9 88 2 2020-06-09 26 24
#10 88 2 2020-06-10 28 NA
#11 88 3 2020-06-11 30 28
#12 88 3 2020-06-12 32 NA
#13 99 1 2020-06-13 34 NA
#14 99 1 2020-06-14 36 NA
#15 99 2 2020-06-15 38 36
#16 99 2 2020-06-16 40 NA
#17 99 3 2020-06-17 42 40
#18 99 3 2020-06-18 44 NA