一列上的 Cumsum 以另一列中值的出现次数为条件

Question

标题真的只是一个非常粗略的想法，很可能与实际问题不相符。

我有一些股票数据，看起来像这样：

"DateTime","Price","Volume","Group"
2020-05-01 13:30:01.354,174.003,750,2020-05-01
2020-05-01 13:30:01.454,174.003,750,2020-05-01
2020-05-01 13:30:01.612,174.592,750,2020-05-01
2020-05-01 13:30:01.663,174.812,750,2020-05-01
2020-05-01 13:30:01.775,174.742,750,2020-05-01
2020-05-01 13:30:02.090,174.742,2000.0001,2020-05-01
2020-05-01 13:30:02.139,174.742,750,2020-05-01
2020-05-01 13:30:02.190,174.743,2000.0001,2020-05-01
2020-05-01 13:30:02.308,174.612,2000.0001,2020-05-01
2020-05-01 13:30:02.428,174.612,750,2020-05-01
2020-05-01 13:30:02.554,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.656,174.552,750,2020-05-01
2020-05-01 13:30:02.705,174.522,2000.0001,2020-05-01
2020-05-01 13:30:02.913,174.645,750,2020-05-01
2020-05-01 13:30:02.963,175.002,750,2020-05-01
2020-05-01 13:30:03.013,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.125,175.002,750,2020-05-01
2020-05-01 13:30:03.312,174.803,750,2020-05-01
2020-05-01 13:30:03.362,175.002,2000.0001,2020-05-01
2020-05-01 13:30:03.876,174.772,750,2020-05-01
2020-05-01 13:30:03.927,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.052,174.802,2000.0001,2020-05-01
2020-05-01 13:30:04.154,174.692,750,2020-05-01
2020-05-01 13:30:04.203,174.802,750,2020-05-01
2020-05-01 13:30:04.255,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.304,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.404,174.802,750,2020-05-01
2020-05-01 13:30:04.455,175.003,2000.0001,2020-05-01
2020-05-01 13:30:04.521,174.803,750,2020-05-01
2020-05-01 13:30:04.649,174.802,750,2020-05-01
2020-05-01 13:30:04.771,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.822,174.803,2000.0001,2020-05-01
2020-05-01 13:30:04.899,174.702,750,2020-05-01
2020-05-01 13:30:04.950,174.802,750,2020-05-01
2020-05-01 13:30:06.498,174.722,750,2020-05-01
2020-05-01 13:30:07.794,174.723,750,2020-05-01
2020-05-01 13:30:07.843,175.003,2000.0001,2020-05-01
2020-05-01 13:30:08.095,175.002,750,2020-05-01
2020-05-01 13:30:08.466,175.002,750,2020-05-01
2020-05-01 13:30:08.567,175.002,750,2020-05-01
2020-05-01 13:30:08.743,174.982,2000.0001,2020-05-01
2020-05-01 13:30:09.123,175.002,750,2020-05-01
2020-05-01 13:30:09.381,174.982,750,2020-05-01
2020-05-01 13:30:09.893,175.002,750,2020-05-01
2020-05-01 13:30:09.942,174.882,750,2020-05-01
2020-05-01 13:30:09.993,174.962,750,2020-05-01
2020-05-01 13:30:11.404,175.002,2000.0001,2020-05-01
2020-05-01 13:30:11.716,174.963,750,2020-05-01
2020-05-01 13:30:11.932,174.963,750,2020-05-01
2020-05-01 13:30:11.983,175.002,750,2020-05-01
2020-05-01 13:30:12.038,174.962,750,2020-05-01
2020-05-01 13:30:12.414,174.963,2000.0001,2020-05-01
2020-05-01 13:30:12.533,174.863,750,2020-05-01
2020-05-01 13:30:12.585,174.962,2000.0001,2020-05-01
2020-05-01 13:30:13.763,175.002,750,2020-05-01
2020-05-01 13:30:14.473,174.962,750,2020-05-01
2020-05-01 13:30:16.157,174.962,750,2020-05-01
2020-05-01 13:30:16.207,175.002,2000.0001,2020-05-01
2020-05-01 13:30:16.268,175.002,750,2020-05-01
2020-05-01 13:30:18.455,175.002,750,2020-05-01
2020-05-01 13:30:18.506,175.322,750,2020-05-01
2020-05-01 13:30:19.289,175.322,750,2020-05-01
2020-05-01 13:30:19.340,175.342,750,2020-05-01
2020-05-01 13:30:19.953,175.343,750,2020-05-01
2020-05-01 13:30:20.761,175.362,2000.0001,2020-05-01
2020-05-01 13:30:21.588,175.363,750,2020-05-01
2020-05-01 13:30:21.638,175.382,750,2020-05-01
2020-05-01 13:30:22.387,175.383,750,2020-05-01
2020-05-01 13:30:22.486,175.442,750,2020-05-01
2020-05-01 13:30:22.580,175.382,750,2020-05-01
2020-05-01 13:30:23.595,175.442,750,2020-05-01
2020-05-01 13:30:23.645,175.383,750,2020-05-01
2020-05-01 13:30:23.762,175.442,750,2020-05-01
2020-05-01 13:30:24.085,175.382,750,2020-05-01
2020-05-01 13:30:24.134,175.273,2000.0001,2020-05-01
2020-05-01 13:30:24.608,175.272,750,2020-05-01
2020-05-01 13:30:24.658,175.272,750,2020-05-01
2020-05-01 13:30:25.019,175.272,750,2020-05-01
2020-05-01 13:30:25.070,175.332,750,2020-05-01
2020-05-01 13:30:25.238,175.283,750,2020-05-01
2020-05-01 13:30:25.289,175.282,2000.0001,2020-05-01
2020-05-01 13:30:25.749,175.273,750,2020-05-01
2020-05-01 13:30:25.799,175.273,2000.0001,2020-05-01
2020-05-01 13:30:25.863,175.273,750,2020-05-01
2020-05-01 13:30:25.914,175.333,2000.0001,2020-05-01
2020-05-01 13:30:26.073,175.283,750,2020-05-01
2020-05-01 13:30:26.124,175.282,2000.0001,2020-05-01
2020-05-01 13:30:26.187,175.203,750,2020-05-01
2020-05-01 13:30:26.237,175.182,2000.0001,2020-05-01
2020-05-01 13:30:26.710,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.511,175.282,2000.0001,2020-05-01
2020-05-01 13:30:27.763,175.332,2000.0001,2020-05-01
2020-05-01 13:30:28.187,175.233,750,2020-05-01
2020-05-01 13:30:28.236,175.232,750,2020-05-01
2020-05-01 13:30:28.302,175.232,750,2020-05-01
2020-05-01 13:30:28.353,175.232,2000.0001,2020-05-01
2020-05-01 13:30:28.457,175.152,750,2020-05-01
2020-05-01 13:30:28.507,175.152,750,2020-05-01
2020-05-01 13:30:28.601,175.153,2000.0001,2020-05-01
2020-05-01 13:30:28.894,175.093,750,2020-05-01
2020-05-01 13:30:28.945,175.092,750,2020-05-01
2020-05-01 13:30:29.049,175.093,2000.0001,2020-05-01

我想做的是根据 Price 列中值的顺序频率为 Volume 计算 cumsum。以下是上述 csv 的示例 r 输出为 data.frame/data.table.

                  DateTime   Price Volume      Group
 1: 2020-05-01 13:30:01.354 174.003    750 2020-05-01
 2: 2020-05-01 13:30:01.454 174.003    750 2020-05-01
 3: 2020-05-01 13:30:01.612 174.592    750 2020-05-01
 4: 2020-05-01 13:30:01.663 174.812    750 2020-05-01
 5: 2020-05-01 13:30:01.775 174.742    750 2020-05-01
 6: 2020-05-01 13:30:02.090 174.742   2000 2020-05-01
 7: 2020-05-01 13:30:02.139 174.742    750 2020-05-01
 8: 2020-05-01 13:30:02.190 174.743   2000 2020-05-01
 9: 2020-05-01 13:30:02.308 174.612   2000 2020-05-01
10: 2020-05-01 13:30:02.428 174.612    750 2020-05-01
 8: 2020-05-01 13:30:02.554 174.522   2000 2020-05-01
 9: 2020-05-01 13:30:02.656 174.552   2000 2020-05-01
10: 2020-05-01 13:30:02.705 174.522    750 2020-05-01

为了更详细地解释，我将取前 2 行并将它们相应的体积相加到输出数据的一个元素中 object（最好是 data.frame）；第 3 行和第 4 行的价格频率为 1（在价格序列中唯一出现）因此它们的交易量不需要相加并且会在输出数据 object 中产生 2 行。第 5、6、7 行的价格是相同的 174.742，然后我再次总结 3 卷并在输出数据 object 中产生 1 行。相同的逻辑适用于其余数据。

我一直在试验 dplyr 但无济于事；我能得到的最好的是每个价格出现的指数组，但不能保留自然顺序。我什至想不出一种方法来接近我想要的结果（在矢量化风格中，我试图避免普通循环和编写太多逻辑）。

按照@RonakShah 的建议，我将添加这个预期的输出示例。抱歉，我没有早点添加一个，因为我真的不确定它会是什么样子。但正如我再次考虑的那样，我认为我只需要 cumsum 卷序列的输出，作为一个单独的列。我真的不需要另一列进行索引（尽管显然任何附加列都可以用作 very-nice-to-have）。

示例输出：

         Cumsum.Volume
    1:       1500  # aggregated from row 1,2 as they have same price
    2:       750   # no aggregation as row 3 has occurred only once
    3:       750   # same as above
    4:       3500  # aggregation from row 4,5,6 as they have the same price
    5:       2000  # row 7 no aggregation for unique price occurrence in sequence
    6:       2750  # row 8,9 maps to same price so add them up
    7:       2000  # row 10 needs no aggregation
    8:       2000  # row 11 needs no aggregation
    9:       750   # row 12 needs no aggregation

这是价格和交易量之间的映射：

`174.003` appeared twice together, so we cumsum their volumes
`174.592` appeared by itself, so we keep its volume
`174.812` appeared by itself, so we keep its volume
`174.742` appeared 3 times consecutively, so we cumsum/aggregate their volumes
`174.743` appeared by itself, so we keep its volume
`174.612` appeared twice together, so we cumsum their volumes
`174.522` appeared once by itself, we keep its volume
`174.552` appeared by itself, we keep its volume
`174,522` appeared again by itself, we keep its volume

希望以上补充能给大家一个更好的思路； cumsum 交易量的顺序与价格栏的自然顺序一致，但我不需要保留价格栏。

有正确方向的提示吗？

Answer 1

正如 Ronak Shah 所说，您可能只想按价格分组，然后对数量求和。但是，您需要小心，因为如果您只是按价格分组，您会不经意地将一些较晚的行与较早的行分组（例如，如果价格上涨然后回落到同一位置）。我猜你不想要这个。因此，您应该根据具有相同价格的相邻行进行分组。你可以这样做：

library(dplyr)

df %>% 
  mutate(PriceChange = cumsum(c(TRUE, diff(Price) != 0))) %>%
  group_by(PriceChange) %>%
  mutate(CumSumVolume = cumsum(Volume)) %>% 
  ungroup() %>% 
  select(-PriceChange) %>%
  as.data.frame()

前几行输出如下所示：

#>                    DateTime   Price Volume      Group CumSumVolume
#> 1   2020-05-01 13:30:01.354 174.003    750 2020-05-01          750
#> 2   2020-05-01 13:30:01.454 174.003    750 2020-05-01         1500
#> 3   2020-05-01 13:30:01.612 174.592    750 2020-05-01          750
#> 4   2020-05-01 13:30:01.663 174.812    750 2020-05-01          750
#> 5   2020-05-01 13:30:01.775 174.742    750 2020-05-01          750
#> 6   2020-05-01 13:30:02.090 174.742   2000 2020-05-01         2750
#> 7   2020-05-01 13:30:02.139 174.742    750 2020-05-01         3500
#> 8   2020-05-01 13:30:02.190 174.743   2000 2020-05-01         2000
#> 9   2020-05-01 13:30:02.308 174.612   2000 2020-05-01         2000
#> 10  2020-05-01 13:30:02.428 174.612    750 2020-05-01         2750
#> 11  2020-05-01 13:30:02.554 174.522   2000 2020-05-01         2000
#> 12  2020-05-01 13:30:02.656 174.552    750 2020-05-01          750
#> 13  2020-05-01 13:30:02.705 174.583   2000 2020-05-01         2000

Answer 2

按 DateTime 的日期部分和 Price

分组

library(tidyverse)

data <- tibble(
          DateTime=ymd_hms(c("2020-05-01 13:30:01.354", "2020-05-01 13:30:01.454", "2020-05-01 13:30:01.612", 
                                        "2020-05-01 13:30:01.663", "2020-05-01 13:30:01.775", "2020-05-01 13:30:02.090", 
                                        "2020-05-01 13:30:02.139", "2020-05-01 13:30:02.190", "2020-05-01 13:30:02.308", 
                                        "2020-05-01 13:30:02.428")),

          Price=c(174.003, 174.003, 174.592, 174.812, 174.742, 174.742, 174.742, 174.743, 174.612, 174.612),
          Volume=c(750, 750, 750, 750, 750, 2000, 750, 2000, 2000, 750))

groupedData <- data %>%
  mutate(Date=lubridate::as_date(DateTime)) %>% 
  group_by(Date, Price) %>% 
  summarise(Volume=sum(Volume)) %>% 
  ungroup()

groupedData

给予

# A tibble: 6 x 3
  Date       Price Volume
  <date>     <dbl>  <dbl>
1 2020-05-01  174.   1500
2 2020-05-01  175.    750
3 2020-05-01  175.   2750
4 2020-05-01  175.   3500
5 2020-05-01  175.   2000
6 2020-05-01  175.    750

一列上的 Cumsum 以另一列中值的出现次数为条件

Cumsum on one column conditional on number of occurence of values from another column

datatable

r

vectorization

dataframe

tidyverse