一列上的 Cumsum 以另一列中值的出现次数为条件
Cumsum on one column conditional on number of occurence of values from another column
标题真的只是一个非常粗略的想法,很可能与实际问题不相符。
我有一些股票数据,看起来像这样:
"DateTime","Price","Volume","Group"
2020-05-01 13:30:01.354,174.003,750,2020-05-01
2020-05-01 13:30:01.454,174.003,750,2020-05-01
2020-05-01 13:30:01.612,174.592,750,2020-05-01
2020-05-01 13:30:01.663,174.812,750,2020-05-01
2020-05-01 13:30:01.775,174.742,750,2020-05-01
2020-05-01 13:30:02.090,174.742,2000.0001,2020-05-01
2020-05-01 13:30:02.139,174.742,750,2020-05-01
2020-05-01 13:30:02.190,174.743,2000.0001,2020-05-01
2020-05-01 13:30:02.308,174.612,2000.0001,2020-05-01
2020-05-01 13:30:02.428,174.612,750,2020-05-01
2020-05-01 13:30:02.554,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.656,174.552,750,2020-05-01
2020-05-01 13:30:02.705,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.913,174.645,750,2020-05-01
2020-05-01 13:30:02.963,175.002,750,2020-05-01
2020-05-01 13:30:03.013,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.125,175.002,750,2020-05-01
2020-05-01 13:30:03.312,174.803,750,2020-05-01
2020-05-01 13:30:03.362,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.876,174.772,750,2020-05-01
2020-05-01 13:30:03.927,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.052,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.154,174.692,750,2020-05-01
2020-05-01 13:30:04.203,174.802,750,2020-05-01
2020-05-01 13:30:04.255,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.304,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.404,174.802,750,2020-05-01
2020-05-01 13:30:04.455,175.003,2000.0001,2020-05-01
2020-05-01 13:30:04.521,174.803,750,2020-05-01
2020-05-01 13:30:04.649,174.802,750,2020-05-01
2020-05-01 13:30:04.771,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.822,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.899,174.702,750,2020-05-01
2020-05-01 13:30:04.950,174.802,750,2020-05-01
2020-05-01 13:30:06.498,174.722,750,2020-05-01
2020-05-01 13:30:07.794,174.723,750,2020-05-01
2020-05-01 13:30:07.843,175.003,2000.0001,2020-05-01
2020-05-01 13:30:08.095,175.002,750,2020-05-01
2020-05-01 13:30:08.466,175.002,750,2020-05-01
2020-05-01 13:30:08.567,175.002,750,2020-05-01
2020-05-01 13:30:08.743,174.982,2000.0001,2020-05-01
2020-05-01 13:30:09.123,175.002,750,2020-05-01
2020-05-01 13:30:09.381,174.982,750,2020-05-01
2020-05-01 13:30:09.893,175.002,750,2020-05-01
2020-05-01 13:30:09.942,174.882,750,2020-05-01
2020-05-01 13:30:09.993,174.962,750,2020-05-01
2020-05-01 13:30:11.404,175.002,2000.0001,2020-05-01
2020-05-01 13:30:11.716,174.963,750,2020-05-01
2020-05-01 13:30:11.932,174.963,750,2020-05-01
2020-05-01 13:30:11.983,175.002,750,2020-05-01
2020-05-01 13:30:12.038,174.962,750,2020-05-01
2020-05-01 13:30:12.414,174.963,2000.0001,2020-05-01
2020-05-01 13:30:12.533,174.863,750,2020-05-01
2020-05-01 13:30:12.585,174.962,2000.0001,2020-05-01
2020-05-01 13:30:13.763,175.002,750,2020-05-01
2020-05-01 13:30:14.473,174.962,750,2020-05-01
2020-05-01 13:30:16.157,174.962,750,2020-05-01
2020-05-01 13:30:16.207,175.002,2000.0001,2020-05-01
2020-05-01 13:30:16.268,175.002,750,2020-05-01
2020-05-01 13:30:18.455,175.002,750,2020-05-01
2020-05-01 13:30:18.506,175.322,750,2020-05-01
2020-05-01 13:30:19.289,175.322,750,2020-05-01
2020-05-01 13:30:19.340,175.342,750,2020-05-01
2020-05-01 13:30:19.953,175.343,750,2020-05-01
2020-05-01 13:30:20.761,175.362,2000.0001,2020-05-01
2020-05-01 13:30:21.588,175.363,750,2020-05-01
2020-05-01 13:30:21.638,175.382,750,2020-05-01
2020-05-01 13:30:22.387,175.383,750,2020-05-01
2020-05-01 13:30:22.486,175.442,750,2020-05-01
2020-05-01 13:30:22.580,175.382,750,2020-05-01
2020-05-01 13:30:23.595,175.442,750,2020-05-01
2020-05-01 13:30:23.645,175.383,750,2020-05-01
2020-05-01 13:30:23.762,175.442,750,2020-05-01
2020-05-01 13:30:24.085,175.382,750,2020-05-01
2020-05-01 13:30:24.134,175.273,2000.0001,2020-05-01
2020-05-01 13:30:24.608,175.272,750,2020-05-01
2020-05-01 13:30:24.658,175.272,750,2020-05-01
2020-05-01 13:30:25.019,175.272,750,2020-05-01
2020-05-01 13:30:25.070,175.332,750,2020-05-01
2020-05-01 13:30:25.238,175.283,750,2020-05-01
2020-05-01 13:30:25.289,175.282,2000.0001,2020-05-01
2020-05-01 13:30:25.749,175.273,750,2020-05-01
2020-05-01 13:30:25.799,175.273,2000.0001,2020-05-01
2020-05-01 13:30:25.863,175.273,750,2020-05-01
2020-05-01 13:30:25.914,175.333,2000.0001,2020-05-01
2020-05-01 13:30:26.073,175.283,750,2020-05-01
2020-05-01 13:30:26.124,175.282,2000.0001,2020-05-01
2020-05-01 13:30:26.187,175.203,750,2020-05-01
2020-05-01 13:30:26.237,175.182,2000.0001,2020-05-01
2020-05-01 13:30:26.710,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.511,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.763,175.332,2000.0001,2020-05-01
2020-05-01 13:30:28.187,175.233,750,2020-05-01
2020-05-01 13:30:28.236,175.232,750,2020-05-01
2020-05-01 13:30:28.302,175.232,750,2020-05-01
2020-05-01 13:30:28.353,175.232,2000.0001,2020-05-01
2020-05-01 13:30:28.457,175.152,750,2020-05-01
2020-05-01 13:30:28.507,175.152,750,2020-05-01
2020-05-01 13:30:28.601,175.153,2000.0001,2020-05-01
2020-05-01 13:30:28.894,175.093,750,2020-05-01
2020-05-01 13:30:28.945,175.092,750,2020-05-01
2020-05-01 13:30:29.049,175.093,2000.0001,2020-05-01
我想做的是根据 Price
列中值的顺序频率为 Volume
计算 cumsum
。以下是上述 csv
的示例 r
输出为 data.frame/data.table
.
DateTime Price Volume Group
1: 2020-05-01 13:30:01.354 174.003 750 2020-05-01
2: 2020-05-01 13:30:01.454 174.003 750 2020-05-01
3: 2020-05-01 13:30:01.612 174.592 750 2020-05-01
4: 2020-05-01 13:30:01.663 174.812 750 2020-05-01
5: 2020-05-01 13:30:01.775 174.742 750 2020-05-01
6: 2020-05-01 13:30:02.090 174.742 2000 2020-05-01
7: 2020-05-01 13:30:02.139 174.742 750 2020-05-01
8: 2020-05-01 13:30:02.190 174.743 2000 2020-05-01
9: 2020-05-01 13:30:02.308 174.612 2000 2020-05-01
10: 2020-05-01 13:30:02.428 174.612 750 2020-05-01
8: 2020-05-01 13:30:02.554 174.522 2000 2020-05-01
9: 2020-05-01 13:30:02.656 174.552 2000 2020-05-01
10: 2020-05-01 13:30:02.705 174.522 750 2020-05-01
为了更详细地解释,我将取前 2 行并将它们相应的体积相加到输出数据的一个元素中 object(最好是 data.frame
);第 3 行和第 4 行的价格频率为 1(在价格序列中唯一出现)因此它们的交易量不需要相加并且会在输出数据 object 中产生 2 行。第 5、6、7 行的价格是相同的 174.742,然后我再次总结 3 卷并在输出数据 object 中产生 1 行。相同的逻辑适用于其余数据。
我一直在试验 dplyr
但无济于事;我能得到的最好的是每个价格出现的指数组,但不能保留自然顺序。我什至想不出一种方法来接近我想要的结果(在矢量化风格中,我试图避免普通循环和编写太多逻辑)。
按照@RonakShah 的建议,我将添加这个预期的输出示例。抱歉,我没有早点添加一个,因为我真的不确定它会是什么样子。但正如我再次考虑的那样,我认为我只需要 cumsum
卷序列的输出,作为一个单独的列。我真的不需要另一列进行索引(尽管显然任何附加列都可以用作 very-nice-to-have)。
示例输出:
Cumsum.Volume
1: 1500 # aggregated from row 1,2 as they have same price
2: 750 # no aggregation as row 3 has occurred only once
3: 750 # same as above
4: 3500 # aggregation from row 4,5,6 as they have the same price
5: 2000 # row 7 no aggregation for unique price occurrence in sequence
6: 2750 # row 8,9 maps to same price so add them up
7: 2000 # row 10 needs no aggregation
8: 2000 # row 11 needs no aggregation
9: 750 # row 12 needs no aggregation
这是价格和交易量之间的映射:
`174.003` appeared twice together, so we cumsum their volumes
`174.592` appeared by itself, so we keep its volume
`174.812` appeared by itself, so we keep its volume
`174.742` appeared 3 times consecutively, so we cumsum/aggregate their volumes
`174.743` appeared by itself, so we keep its volume
`174.612` appeared twice together, so we cumsum their volumes
`174.522` appeared once by itself, we keep its volume
`174.552` appeared by itself, we keep its volume
`174,522` appeared again by itself, we keep its volume
希望以上补充能给大家一个更好的思路; cumsum
交易量的顺序与价格栏的自然顺序一致,但我不需要保留价格栏。
有正确方向的提示吗?
正如 Ronak Shah 所说,您可能只想按价格分组,然后对数量求和。但是,您需要小心,因为如果您只是按价格分组,您会不经意地将一些较晚的行与较早的行分组(例如,如果价格上涨然后回落到同一位置)。我猜你不想要这个。因此,您应该根据具有相同价格的相邻行进行分组。你可以这样做:
library(dplyr)
df %>%
mutate(PriceChange = cumsum(c(TRUE, diff(Price) != 0))) %>%
group_by(PriceChange) %>%
mutate(CumSumVolume = cumsum(Volume)) %>%
ungroup() %>%
select(-PriceChange) %>%
as.data.frame()
前几行输出如下所示:
#> DateTime Price Volume Group CumSumVolume
#> 1 2020-05-01 13:30:01.354 174.003 750 2020-05-01 750
#> 2 2020-05-01 13:30:01.454 174.003 750 2020-05-01 1500
#> 3 2020-05-01 13:30:01.612 174.592 750 2020-05-01 750
#> 4 2020-05-01 13:30:01.663 174.812 750 2020-05-01 750
#> 5 2020-05-01 13:30:01.775 174.742 750 2020-05-01 750
#> 6 2020-05-01 13:30:02.090 174.742 2000 2020-05-01 2750
#> 7 2020-05-01 13:30:02.139 174.742 750 2020-05-01 3500
#> 8 2020-05-01 13:30:02.190 174.743 2000 2020-05-01 2000
#> 9 2020-05-01 13:30:02.308 174.612 2000 2020-05-01 2000
#> 10 2020-05-01 13:30:02.428 174.612 750 2020-05-01 2750
#> 11 2020-05-01 13:30:02.554 174.522 2000 2020-05-01 2000
#> 12 2020-05-01 13:30:02.656 174.552 750 2020-05-01 750
#> 13 2020-05-01 13:30:02.705 174.583 2000 2020-05-01 2000
按 DateTime
的日期部分和 Price
分组
library(tidyverse)
data <- tibble(
DateTime=ymd_hms(c("2020-05-01 13:30:01.354", "2020-05-01 13:30:01.454", "2020-05-01 13:30:01.612",
"2020-05-01 13:30:01.663", "2020-05-01 13:30:01.775", "2020-05-01 13:30:02.090",
"2020-05-01 13:30:02.139", "2020-05-01 13:30:02.190", "2020-05-01 13:30:02.308",
"2020-05-01 13:30:02.428")),
Price=c(174.003, 174.003, 174.592, 174.812, 174.742, 174.742, 174.742, 174.743, 174.612, 174.612),
Volume=c(750, 750, 750, 750, 750, 2000, 750, 2000, 2000, 750))
groupedData <- data %>%
mutate(Date=lubridate::as_date(DateTime)) %>%
group_by(Date, Price) %>%
summarise(Volume=sum(Volume)) %>%
ungroup()
groupedData
给予
# A tibble: 6 x 3
Date Price Volume
<date> <dbl> <dbl>
1 2020-05-01 174. 1500
2 2020-05-01 175. 750
3 2020-05-01 175. 2750
4 2020-05-01 175. 3500
5 2020-05-01 175. 2000
6 2020-05-01 175. 750
标题真的只是一个非常粗略的想法,很可能与实际问题不相符。
我有一些股票数据,看起来像这样:
"DateTime","Price","Volume","Group"
2020-05-01 13:30:01.354,174.003,750,2020-05-01
2020-05-01 13:30:01.454,174.003,750,2020-05-01
2020-05-01 13:30:01.612,174.592,750,2020-05-01
2020-05-01 13:30:01.663,174.812,750,2020-05-01
2020-05-01 13:30:01.775,174.742,750,2020-05-01
2020-05-01 13:30:02.090,174.742,2000.0001,2020-05-01
2020-05-01 13:30:02.139,174.742,750,2020-05-01
2020-05-01 13:30:02.190,174.743,2000.0001,2020-05-01
2020-05-01 13:30:02.308,174.612,2000.0001,2020-05-01
2020-05-01 13:30:02.428,174.612,750,2020-05-01
2020-05-01 13:30:02.554,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.656,174.552,750,2020-05-01
2020-05-01 13:30:02.705,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.913,174.645,750,2020-05-01
2020-05-01 13:30:02.963,175.002,750,2020-05-01
2020-05-01 13:30:03.013,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.125,175.002,750,2020-05-01
2020-05-01 13:30:03.312,174.803,750,2020-05-01
2020-05-01 13:30:03.362,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.876,174.772,750,2020-05-01
2020-05-01 13:30:03.927,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.052,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.154,174.692,750,2020-05-01
2020-05-01 13:30:04.203,174.802,750,2020-05-01
2020-05-01 13:30:04.255,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.304,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.404,174.802,750,2020-05-01
2020-05-01 13:30:04.455,175.003,2000.0001,2020-05-01
2020-05-01 13:30:04.521,174.803,750,2020-05-01
2020-05-01 13:30:04.649,174.802,750,2020-05-01
2020-05-01 13:30:04.771,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.822,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.899,174.702,750,2020-05-01
2020-05-01 13:30:04.950,174.802,750,2020-05-01
2020-05-01 13:30:06.498,174.722,750,2020-05-01
2020-05-01 13:30:07.794,174.723,750,2020-05-01
2020-05-01 13:30:07.843,175.003,2000.0001,2020-05-01
2020-05-01 13:30:08.095,175.002,750,2020-05-01
2020-05-01 13:30:08.466,175.002,750,2020-05-01
2020-05-01 13:30:08.567,175.002,750,2020-05-01
2020-05-01 13:30:08.743,174.982,2000.0001,2020-05-01
2020-05-01 13:30:09.123,175.002,750,2020-05-01
2020-05-01 13:30:09.381,174.982,750,2020-05-01
2020-05-01 13:30:09.893,175.002,750,2020-05-01
2020-05-01 13:30:09.942,174.882,750,2020-05-01
2020-05-01 13:30:09.993,174.962,750,2020-05-01
2020-05-01 13:30:11.404,175.002,2000.0001,2020-05-01
2020-05-01 13:30:11.716,174.963,750,2020-05-01
2020-05-01 13:30:11.932,174.963,750,2020-05-01
2020-05-01 13:30:11.983,175.002,750,2020-05-01
2020-05-01 13:30:12.038,174.962,750,2020-05-01
2020-05-01 13:30:12.414,174.963,2000.0001,2020-05-01
2020-05-01 13:30:12.533,174.863,750,2020-05-01
2020-05-01 13:30:12.585,174.962,2000.0001,2020-05-01
2020-05-01 13:30:13.763,175.002,750,2020-05-01
2020-05-01 13:30:14.473,174.962,750,2020-05-01
2020-05-01 13:30:16.157,174.962,750,2020-05-01
2020-05-01 13:30:16.207,175.002,2000.0001,2020-05-01
2020-05-01 13:30:16.268,175.002,750,2020-05-01
2020-05-01 13:30:18.455,175.002,750,2020-05-01
2020-05-01 13:30:18.506,175.322,750,2020-05-01
2020-05-01 13:30:19.289,175.322,750,2020-05-01
2020-05-01 13:30:19.340,175.342,750,2020-05-01
2020-05-01 13:30:19.953,175.343,750,2020-05-01
2020-05-01 13:30:20.761,175.362,2000.0001,2020-05-01
2020-05-01 13:30:21.588,175.363,750,2020-05-01
2020-05-01 13:30:21.638,175.382,750,2020-05-01
2020-05-01 13:30:22.387,175.383,750,2020-05-01
2020-05-01 13:30:22.486,175.442,750,2020-05-01
2020-05-01 13:30:22.580,175.382,750,2020-05-01
2020-05-01 13:30:23.595,175.442,750,2020-05-01
2020-05-01 13:30:23.645,175.383,750,2020-05-01
2020-05-01 13:30:23.762,175.442,750,2020-05-01
2020-05-01 13:30:24.085,175.382,750,2020-05-01
2020-05-01 13:30:24.134,175.273,2000.0001,2020-05-01
2020-05-01 13:30:24.608,175.272,750,2020-05-01
2020-05-01 13:30:24.658,175.272,750,2020-05-01
2020-05-01 13:30:25.019,175.272,750,2020-05-01
2020-05-01 13:30:25.070,175.332,750,2020-05-01
2020-05-01 13:30:25.238,175.283,750,2020-05-01
2020-05-01 13:30:25.289,175.282,2000.0001,2020-05-01
2020-05-01 13:30:25.749,175.273,750,2020-05-01
2020-05-01 13:30:25.799,175.273,2000.0001,2020-05-01
2020-05-01 13:30:25.863,175.273,750,2020-05-01
2020-05-01 13:30:25.914,175.333,2000.0001,2020-05-01
2020-05-01 13:30:26.073,175.283,750,2020-05-01
2020-05-01 13:30:26.124,175.282,2000.0001,2020-05-01
2020-05-01 13:30:26.187,175.203,750,2020-05-01
2020-05-01 13:30:26.237,175.182,2000.0001,2020-05-01
2020-05-01 13:30:26.710,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.511,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.763,175.332,2000.0001,2020-05-01
2020-05-01 13:30:28.187,175.233,750,2020-05-01
2020-05-01 13:30:28.236,175.232,750,2020-05-01
2020-05-01 13:30:28.302,175.232,750,2020-05-01
2020-05-01 13:30:28.353,175.232,2000.0001,2020-05-01
2020-05-01 13:30:28.457,175.152,750,2020-05-01
2020-05-01 13:30:28.507,175.152,750,2020-05-01
2020-05-01 13:30:28.601,175.153,2000.0001,2020-05-01
2020-05-01 13:30:28.894,175.093,750,2020-05-01
2020-05-01 13:30:28.945,175.092,750,2020-05-01
2020-05-01 13:30:29.049,175.093,2000.0001,2020-05-01
我想做的是根据 Price
列中值的顺序频率为 Volume
计算 cumsum
。以下是上述 csv
的示例 r
输出为 data.frame/data.table
.
DateTime Price Volume Group
1: 2020-05-01 13:30:01.354 174.003 750 2020-05-01
2: 2020-05-01 13:30:01.454 174.003 750 2020-05-01
3: 2020-05-01 13:30:01.612 174.592 750 2020-05-01
4: 2020-05-01 13:30:01.663 174.812 750 2020-05-01
5: 2020-05-01 13:30:01.775 174.742 750 2020-05-01
6: 2020-05-01 13:30:02.090 174.742 2000 2020-05-01
7: 2020-05-01 13:30:02.139 174.742 750 2020-05-01
8: 2020-05-01 13:30:02.190 174.743 2000 2020-05-01
9: 2020-05-01 13:30:02.308 174.612 2000 2020-05-01
10: 2020-05-01 13:30:02.428 174.612 750 2020-05-01
8: 2020-05-01 13:30:02.554 174.522 2000 2020-05-01
9: 2020-05-01 13:30:02.656 174.552 2000 2020-05-01
10: 2020-05-01 13:30:02.705 174.522 750 2020-05-01
为了更详细地解释,我将取前 2 行并将它们相应的体积相加到输出数据的一个元素中 object(最好是 data.frame
);第 3 行和第 4 行的价格频率为 1(在价格序列中唯一出现)因此它们的交易量不需要相加并且会在输出数据 object 中产生 2 行。第 5、6、7 行的价格是相同的 174.742,然后我再次总结 3 卷并在输出数据 object 中产生 1 行。相同的逻辑适用于其余数据。
我一直在试验 dplyr
但无济于事;我能得到的最好的是每个价格出现的指数组,但不能保留自然顺序。我什至想不出一种方法来接近我想要的结果(在矢量化风格中,我试图避免普通循环和编写太多逻辑)。
按照@RonakShah 的建议,我将添加这个预期的输出示例。抱歉,我没有早点添加一个,因为我真的不确定它会是什么样子。但正如我再次考虑的那样,我认为我只需要 cumsum
卷序列的输出,作为一个单独的列。我真的不需要另一列进行索引(尽管显然任何附加列都可以用作 very-nice-to-have)。
示例输出:
Cumsum.Volume
1: 1500 # aggregated from row 1,2 as they have same price
2: 750 # no aggregation as row 3 has occurred only once
3: 750 # same as above
4: 3500 # aggregation from row 4,5,6 as they have the same price
5: 2000 # row 7 no aggregation for unique price occurrence in sequence
6: 2750 # row 8,9 maps to same price so add them up
7: 2000 # row 10 needs no aggregation
8: 2000 # row 11 needs no aggregation
9: 750 # row 12 needs no aggregation
这是价格和交易量之间的映射:
`174.003` appeared twice together, so we cumsum their volumes
`174.592` appeared by itself, so we keep its volume
`174.812` appeared by itself, so we keep its volume
`174.742` appeared 3 times consecutively, so we cumsum/aggregate their volumes
`174.743` appeared by itself, so we keep its volume
`174.612` appeared twice together, so we cumsum their volumes
`174.522` appeared once by itself, we keep its volume
`174.552` appeared by itself, we keep its volume
`174,522` appeared again by itself, we keep its volume
希望以上补充能给大家一个更好的思路; cumsum
交易量的顺序与价格栏的自然顺序一致,但我不需要保留价格栏。
有正确方向的提示吗?
正如 Ronak Shah 所说,您可能只想按价格分组,然后对数量求和。但是,您需要小心,因为如果您只是按价格分组,您会不经意地将一些较晚的行与较早的行分组(例如,如果价格上涨然后回落到同一位置)。我猜你不想要这个。因此,您应该根据具有相同价格的相邻行进行分组。你可以这样做:
library(dplyr)
df %>%
mutate(PriceChange = cumsum(c(TRUE, diff(Price) != 0))) %>%
group_by(PriceChange) %>%
mutate(CumSumVolume = cumsum(Volume)) %>%
ungroup() %>%
select(-PriceChange) %>%
as.data.frame()
前几行输出如下所示:
#> DateTime Price Volume Group CumSumVolume
#> 1 2020-05-01 13:30:01.354 174.003 750 2020-05-01 750
#> 2 2020-05-01 13:30:01.454 174.003 750 2020-05-01 1500
#> 3 2020-05-01 13:30:01.612 174.592 750 2020-05-01 750
#> 4 2020-05-01 13:30:01.663 174.812 750 2020-05-01 750
#> 5 2020-05-01 13:30:01.775 174.742 750 2020-05-01 750
#> 6 2020-05-01 13:30:02.090 174.742 2000 2020-05-01 2750
#> 7 2020-05-01 13:30:02.139 174.742 750 2020-05-01 3500
#> 8 2020-05-01 13:30:02.190 174.743 2000 2020-05-01 2000
#> 9 2020-05-01 13:30:02.308 174.612 2000 2020-05-01 2000
#> 10 2020-05-01 13:30:02.428 174.612 750 2020-05-01 2750
#> 11 2020-05-01 13:30:02.554 174.522 2000 2020-05-01 2000
#> 12 2020-05-01 13:30:02.656 174.552 750 2020-05-01 750
#> 13 2020-05-01 13:30:02.705 174.583 2000 2020-05-01 2000
按 DateTime
的日期部分和 Price
library(tidyverse)
data <- tibble(
DateTime=ymd_hms(c("2020-05-01 13:30:01.354", "2020-05-01 13:30:01.454", "2020-05-01 13:30:01.612",
"2020-05-01 13:30:01.663", "2020-05-01 13:30:01.775", "2020-05-01 13:30:02.090",
"2020-05-01 13:30:02.139", "2020-05-01 13:30:02.190", "2020-05-01 13:30:02.308",
"2020-05-01 13:30:02.428")),
Price=c(174.003, 174.003, 174.592, 174.812, 174.742, 174.742, 174.742, 174.743, 174.612, 174.612),
Volume=c(750, 750, 750, 750, 750, 2000, 750, 2000, 2000, 750))
groupedData <- data %>%
mutate(Date=lubridate::as_date(DateTime)) %>%
group_by(Date, Price) %>%
summarise(Volume=sum(Volume)) %>%
ungroup()
groupedData
给予
# A tibble: 6 x 3
Date Price Volume
<date> <dbl> <dbl>
1 2020-05-01 174. 1500
2 2020-05-01 175. 750
3 2020-05-01 175. 2750
4 2020-05-01 175. 3500
5 2020-05-01 175. 2000
6 2020-05-01 175. 750