多次累计和

Question

希望标题足够明确。

我有一个 table 看起来像这样 :

classes id value
a       1  10
a       2  15
a       3  12
b       1  5
b       2  9
b       3  7
c       1  6
c       2  14
c       3  6

这是我想要的：

classes id value cumsum
a       1  10    10
a       2  15    25
a       3  12    37
b       1  5     5
b       2  9     14
b       3  7     21
c       1  6     6
c       2  14    20
c       3  6     26

我见过 this solution，我已经成功地将它应用到我没有多个类的情况：

id value cumsum
1  10    10
2  15    25
3  12    37

速度相当快，即使数据集的大小与我当前正在处理的数据集相当。

但是，当我尝试将完全相同的代码应用于我现在正在处理的数据集时（看起来像这个问题的第一个 table，IE 多个类），没有通过 a、b、c 对它进行子集化，在我看来这需要很长时间（现在已经运行 4 个小时了。数据集有 40.000 行）。

想知道当行数增加时 linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes 中的代码是否存在问题，从而使整个过程变慢很多，甚至可能因为有多个 "classes" 做累加。

有什么方法可以更快地完成吗？我通过 SQLDF 包在 R 中使用 SQL。 R 代码（有或没有外部通用包）或 SQL 代码的解决方案都可以。

谢谢

Answer 1

在 SQL 中，您可以使用 ANSI 标准 sum() over () 功能进行累加：

select classes, id, value,
       sum(value) over (partition by classes order by id) as cumesum
from t;

Answer 2

或者您可以使用 base 包中的 by：

df$cumsum <- unlist(by(df$value, df$classes, cumsum))
#  classes id value cumsum
#1       a  1    10     10
#2       a  2    15     25
#3       a  3    12     37
#4       b  1     5      5
#5       b  2     9     14
#6       b  3     7     21
#7       c  1     6      6
#8       c  2    14     20
#9       c  3     6     26

多次累计和

Multiple cumulative sums

sql

r

sqldf