Spark 如何在按使用时间 window 分组时确定第一个 window 的 window.start?
How does Spark determine window.start for first window when grouping by time window used?
这是数据示例:
scala> purchases.show(false)
+---------+-------------------+--------+
|client_id|transaction_ts |store_id|
+---------+-------------------+--------+
|1 |2018-06-01 12:17:37|1 |
|1 |2018-06-02 13:17:37|2 |
|1 |2018-06-03 14:17:37|3 |
|1 |2018-06-09 10:17:37|2 |
|2 |2018-06-02 10:17:37|1 |
|2 |2018-06-02 13:17:37|2 |
|2 |2018-06-08 14:19:37|3 |
|2 |2018-06-16 13:17:37|2 |
|2 |2018-06-17 14:17:37|3 |
+---------+-------------------+--------+
当我按时间分组时window:
scala> purchases.groupBy($"client_id", window($"transaction_ts", "8 days")).count.orderBy("client_id", "window.start")show(false)
+---------+---------------------------------------------+-----+
|client_id|window |count|
+---------+---------------------------------------------+-----+
|1 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|3 |
|1 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1 |
|2 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|2 |
|2 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1 |
|2 |[2018-06-13 17:00:00.0,2018-06-21 17:00:00.0]|2 |
+---------+---------------------------------------------+-----+
我想知道为什么第一个 window.start
是 2018-05-28 17:00:00.0
而数据中的最小值是 2018-06-01 12:17:37
?
Spark 是如何计算时间的windows?我期待第一个最小值将用作 min window.start
...
谢谢@user8371915!
按照建议的链接,我找到了我正在寻找的答案,特别是 window.start
与我的数据无关,Spark 从 1970-01-01
开始生成 window。有关更多详细信息,请参阅 What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?
这是数据示例:
scala> purchases.show(false)
+---------+-------------------+--------+
|client_id|transaction_ts |store_id|
+---------+-------------------+--------+
|1 |2018-06-01 12:17:37|1 |
|1 |2018-06-02 13:17:37|2 |
|1 |2018-06-03 14:17:37|3 |
|1 |2018-06-09 10:17:37|2 |
|2 |2018-06-02 10:17:37|1 |
|2 |2018-06-02 13:17:37|2 |
|2 |2018-06-08 14:19:37|3 |
|2 |2018-06-16 13:17:37|2 |
|2 |2018-06-17 14:17:37|3 |
+---------+-------------------+--------+
当我按时间分组时window:
scala> purchases.groupBy($"client_id", window($"transaction_ts", "8 days")).count.orderBy("client_id", "window.start")show(false)
+---------+---------------------------------------------+-----+
|client_id|window |count|
+---------+---------------------------------------------+-----+
|1 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|3 |
|1 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1 |
|2 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|2 |
|2 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|1 |
|2 |[2018-06-13 17:00:00.0,2018-06-21 17:00:00.0]|2 |
+---------+---------------------------------------------+-----+
我想知道为什么第一个 window.start
是 2018-05-28 17:00:00.0
而数据中的最小值是 2018-06-01 12:17:37
?
Spark 是如何计算时间的windows?我期待第一个最小值将用作 min window.start
...
谢谢@user8371915!
按照建议的链接,我找到了我正在寻找的答案,特别是 window.start
与我的数据无关,Spark 从 1970-01-01
开始生成 window。有关更多详细信息,请参阅 What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?