如何对缺少观察值的分组数据使用累积和函数
How to use cumulative sum function for grouped data with missing observations
我使用的数据框看起来像这样:
DATUM CP SMER TRH MNOZSTVI CENA POPLATKY OBJEM UCET KVARTAL ROK AKTUALNI.MNOZSTVI
<dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 2020-03-03 00:00:00 CEZ K BCPP 50 465. 91.3 -23240 CZK Q1 2020 NA
2 2020-03-04 00:00:00 CEZ K BCPP 50 467. 58.9 -13980 CZK Q1 2020 NA
3 2020-03-12 00:00:00 CEZ P BCPP 30 398 51.8 11940 CZK Q1 2020 NA
4 2020-03-25 00:00:00 KOMERCNI BANKA K BCPP 40 542 85.9 -21680 CZK Q1 2020 NA
5 2020-03-25 00:00:00 MONETA MONEY BANK K BCPP 300 58.4 71.3 -17505 CZK Q1 2020 NA
6 2020-03-30 00:00:00 CEZ K BCPP 10 391 50 -3910 CZK Q1 2020 NA
7 2020-04-02 00:00:00 USD K NA 1000 25.8 0 -25778 CZK Q2 2020 NA
8 2020-04-03 00:00:00 USD K NA 3000 26.1 0 -78392 CZK Q2 2020 NA
9 2020-04-04 00:00:00 USD K NA 1000 26.4 0 -26363. CZK Q2 2020 NA
10 2020-04-06 00:00:00 AVAST K BCPP 150 125. 75.8 -18810 CZK Q2 2020 NA
我想把变量MNOZSTVI的累计和填入按CP分组的变量AKTUALNI.MNOZSTVI。所以向量 AKTUALNI.MNOZSTVI 应该是 c(50,100,130,40,300,140,1000,4000,5000,150, etc.).
问题是 MNOZSTVI 的某些值缺失,所以我不知道如何使用无法处理缺失值的函数 cumsun() + 我很难为分组数据执行它。
有没有人知道如何借助 cumsum() 或其他函数来做到这一点?
谢谢。
我们可以按'CP'分组,得到mutate
中'MNOZSTVI'的cumsum
library(dplyr)
df1 <- df1 %>%
group_by(CP) %>%
mutate(AKTUALNI.MNOZSTVI = cumsum(MNOZSTVI))
或使用 base R
和 ave
df1$AKTUALNI.MNOZSTVI <- with(df1, ave(MNOZSTVI, CP, FUN = cumsum))
library(dplyr)
df %>%
group_by(CP) %>%
mutate(AKTUALNI.MNOZSTVI = cumsum(MNOZSTVI))
输出:
DATUM CP SMER TRH MNOZSTVI CENA POPLATKY OBJEM UCET KVARTAL ROK.AKTUALNI..MNOZSTVI AKTUALNI.MNOZSTVI
<chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 2020-03-03 00:00:00 CEZ K BCPP 50 465 91.3 NA CZK Q1 2020 NA 50
2 2020-03-04 00:00:00 CEZ K BCPP 50 467 58.9 NA CZK Q1 2020 NA 100
3 2020-03-12 00:00:00 CEZ P BCPP 30 398 51.8 11940 CZK Q1 2020 130
4 2020-03-25 00:00:00 KOMERCNI BANKA K BCPP 40 542 85.9 - CZK Q1 2020 40
5 2020-03-25 00:00:00 MONETA MONEY BANK K BCPP 300 58.4 71.3 - CZK Q1 2020 300
6 2020-03-30 00:00:00 CEZ K BCPP 10 391 50 - CZK Q1 2020 140
7 2020-04-02 00:00:00 USD K NA 1000 25.8 0 - CZK Q2 2020 1000
8 2020-04-03 00:00:00 USD K NA 3000 26.1 0 - CZK Q2 2020 4000
9 2020-04-04 00:00:00 USD K NA 1000 26.4 0 - CZK Q2 2020 5000
10 2020-04-06 00:00:00 AVAST K BCPP 150 125 75. 8 - CZK Q2 2020 150
数据:
df <- tibble::tribble(
~DATUM, ~CP, ~SMER, ~TRH, ~MNOZSTVI, ~CENA, ~POPLATKY, ~OBJEM, ~UCET, ~KVARTAL, ~ROK.AKTUALNI..MNOZSTVI,
"2020-03-03", "00:00:00 CEZ", "K", "BCPP", 50L, 465, "91.3", NA, "CZK", "Q1", "2020 NA",
"2020-03-04", "00:00:00 CEZ", "K", "BCPP", 50L, 467, "58.9", NA, "CZK", "Q1", "2020 NA",
"2020-03-12", "00:00:00 CEZ", "P", "BCPP", 30L, 398, "51.8", "11940", "CZK", "Q1", "2020",
"2020-03-25", "00:00:00 KOMERCNI BANKA", "K", "BCPP", 40L, 542, "85.9", "-", "CZK", "Q1", "2020",
"2020-03-25", "00:00:00 MONETA MONEY BANK", "K", "BCPP", 300L, 58.4, "71.3", "-", "CZK", "Q1", "2020",
"2020-03-30", "00:00:00 CEZ", "K", "BCPP", 10L, 391, "50", "-", "CZK", "Q1", "2020",
"2020-04-02", "00:00:00 USD", "K", NA, 1000L, 25.8, "0", "-", "CZK", "Q2", "2020",
"2020-04-03", "00:00:00 USD", "K", NA, 3000L, 26.1, "0", "-", "CZK", "Q2", "2020",
"2020-04-04", "00:00:00 USD", "K", NA, 1000L, 26.4, "0", "-", "CZK", "Q2", "2020",
"2020-04-06", "00:00:00 AVAST", "K", "BCPP", 150L, 125, "75. 8", "-", "CZK", "Q2", "2020"
)
我使用的数据框看起来像这样:
DATUM CP SMER TRH MNOZSTVI CENA POPLATKY OBJEM UCET KVARTAL ROK AKTUALNI.MNOZSTVI
<dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 2020-03-03 00:00:00 CEZ K BCPP 50 465. 91.3 -23240 CZK Q1 2020 NA
2 2020-03-04 00:00:00 CEZ K BCPP 50 467. 58.9 -13980 CZK Q1 2020 NA
3 2020-03-12 00:00:00 CEZ P BCPP 30 398 51.8 11940 CZK Q1 2020 NA
4 2020-03-25 00:00:00 KOMERCNI BANKA K BCPP 40 542 85.9 -21680 CZK Q1 2020 NA
5 2020-03-25 00:00:00 MONETA MONEY BANK K BCPP 300 58.4 71.3 -17505 CZK Q1 2020 NA
6 2020-03-30 00:00:00 CEZ K BCPP 10 391 50 -3910 CZK Q1 2020 NA
7 2020-04-02 00:00:00 USD K NA 1000 25.8 0 -25778 CZK Q2 2020 NA
8 2020-04-03 00:00:00 USD K NA 3000 26.1 0 -78392 CZK Q2 2020 NA
9 2020-04-04 00:00:00 USD K NA 1000 26.4 0 -26363. CZK Q2 2020 NA
10 2020-04-06 00:00:00 AVAST K BCPP 150 125. 75.8 -18810 CZK Q2 2020 NA
我想把变量MNOZSTVI的累计和填入按CP分组的变量AKTUALNI.MNOZSTVI。所以向量 AKTUALNI.MNOZSTVI 应该是 c(50,100,130,40,300,140,1000,4000,5000,150, etc.).
问题是 MNOZSTVI 的某些值缺失,所以我不知道如何使用无法处理缺失值的函数 cumsun() + 我很难为分组数据执行它。
有没有人知道如何借助 cumsum() 或其他函数来做到这一点? 谢谢。
我们可以按'CP'分组,得到mutate
cumsum
library(dplyr)
df1 <- df1 %>%
group_by(CP) %>%
mutate(AKTUALNI.MNOZSTVI = cumsum(MNOZSTVI))
或使用 base R
和 ave
df1$AKTUALNI.MNOZSTVI <- with(df1, ave(MNOZSTVI, CP, FUN = cumsum))
library(dplyr)
df %>%
group_by(CP) %>%
mutate(AKTUALNI.MNOZSTVI = cumsum(MNOZSTVI))
输出:
DATUM CP SMER TRH MNOZSTVI CENA POPLATKY OBJEM UCET KVARTAL ROK.AKTUALNI..MNOZSTVI AKTUALNI.MNOZSTVI
<chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 2020-03-03 00:00:00 CEZ K BCPP 50 465 91.3 NA CZK Q1 2020 NA 50
2 2020-03-04 00:00:00 CEZ K BCPP 50 467 58.9 NA CZK Q1 2020 NA 100
3 2020-03-12 00:00:00 CEZ P BCPP 30 398 51.8 11940 CZK Q1 2020 130
4 2020-03-25 00:00:00 KOMERCNI BANKA K BCPP 40 542 85.9 - CZK Q1 2020 40
5 2020-03-25 00:00:00 MONETA MONEY BANK K BCPP 300 58.4 71.3 - CZK Q1 2020 300
6 2020-03-30 00:00:00 CEZ K BCPP 10 391 50 - CZK Q1 2020 140
7 2020-04-02 00:00:00 USD K NA 1000 25.8 0 - CZK Q2 2020 1000
8 2020-04-03 00:00:00 USD K NA 3000 26.1 0 - CZK Q2 2020 4000
9 2020-04-04 00:00:00 USD K NA 1000 26.4 0 - CZK Q2 2020 5000
10 2020-04-06 00:00:00 AVAST K BCPP 150 125 75. 8 - CZK Q2 2020 150
数据:
df <- tibble::tribble(
~DATUM, ~CP, ~SMER, ~TRH, ~MNOZSTVI, ~CENA, ~POPLATKY, ~OBJEM, ~UCET, ~KVARTAL, ~ROK.AKTUALNI..MNOZSTVI,
"2020-03-03", "00:00:00 CEZ", "K", "BCPP", 50L, 465, "91.3", NA, "CZK", "Q1", "2020 NA",
"2020-03-04", "00:00:00 CEZ", "K", "BCPP", 50L, 467, "58.9", NA, "CZK", "Q1", "2020 NA",
"2020-03-12", "00:00:00 CEZ", "P", "BCPP", 30L, 398, "51.8", "11940", "CZK", "Q1", "2020",
"2020-03-25", "00:00:00 KOMERCNI BANKA", "K", "BCPP", 40L, 542, "85.9", "-", "CZK", "Q1", "2020",
"2020-03-25", "00:00:00 MONETA MONEY BANK", "K", "BCPP", 300L, 58.4, "71.3", "-", "CZK", "Q1", "2020",
"2020-03-30", "00:00:00 CEZ", "K", "BCPP", 10L, 391, "50", "-", "CZK", "Q1", "2020",
"2020-04-02", "00:00:00 USD", "K", NA, 1000L, 25.8, "0", "-", "CZK", "Q2", "2020",
"2020-04-03", "00:00:00 USD", "K", NA, 3000L, 26.1, "0", "-", "CZK", "Q2", "2020",
"2020-04-04", "00:00:00 USD", "K", NA, 1000L, 26.4, "0", "-", "CZK", "Q2", "2020",
"2020-04-06", "00:00:00 AVAST", "K", "BCPP", 150L, 125, "75. 8", "-", "CZK", "Q2", "2020"
)