在 Spark 中使用带有未定义框架的 WindowSpec 是否安全?
Is it safe to use a WindowSpec with an undefined frame in Spark?
我经常使用 Window-Apache Spark 中的函数,例如计算累计和。到目前为止,我从未指定帧,因为输出是正确的。但最近我在一篇博客中看到 (https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html):
In addition to the ordering and partitioning, users need to define the
start boundary of the frame, the end boundary of the frame, and the
type of the frame, which are three components of a frame
specification.
所以我想知道使用未指定的框架是否安全,例如:
import org.apache.spark.sql.expressions.Window
val df = (1 to 10000).toDF("i")
df
.select(
$"i",
sum($"i").over(Window.orderBy($"i")).as("running_sum1"),//unspecified frame
sum($"i").over(Window.orderBy($"i").rowsBetween(Window.unboundedPreceding, Window.currentRow)).as("running_sum2") // specified frame
)
.show()
+---+------------+------------+
| i|running_sum1|running_sum2|
+---+------------+------------+
| 1| 1| 1|
| 2| 3| 3|
| 3| 6| 6|
| 4| 10| 10|
| 5| 15| 15|
| 6| 21| 21|
| 7| 28| 28|
| 8| 36| 36|
| 9| 45| 45|
| 10| 55| 55|
| 11| 66| 66|
| 12| 78| 78|
| 13| 91| 91|
| 14| 105| 105|
| 15| 120| 120|
| 16| 136| 136|
| 17| 153| 153|
| 18| 171| 171|
| 19| 190| 190|
| 20| 210| 210|
+---+------------+------------+
显然它们给出了相同的输出,但是在某些情况下使用未指定的框架会很危险吗?顺便使用 Spark 2.x。
是的,很安全。
查看[=17=上Window
对象的master分支的源代码,有如下注释(2.3.0分支中不存在):
When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
换句话说,当window上有一个排序时,即使用orderBy
,框架上未指定的边界等于有:
rowsBetween(Window.unboundedPreceding, Window.currentRow)
在不使用orderBy
的情况下,默认为无界实体window:
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
进一步调查表明,自从 Spark 1.4.0 中引入 window 函数以来,这些默认值已被使用,相关 github branch:
def defaultWindowFrame(
hasOrderSpecification: Boolean,
acceptWindowFrame: Boolean): SpecifiedWindowFrame = {
if (hasOrderSpecification && acceptWindowFrame) {
// If order spec is defined and the window function supports user specified window frames,
// the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SpecifiedWindowFrame(RangeFrame, UnboundedPreceding, CurrentRow)
} else {
// Otherwise, the default frame is
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
SpecifiedWindowFrame(RowFrame, UnboundedPreceding, UnboundedFollowing)
}
}
我经常使用 Window-Apache Spark 中的函数,例如计算累计和。到目前为止,我从未指定帧,因为输出是正确的。但最近我在一篇博客中看到 (https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html):
In addition to the ordering and partitioning, users need to define the start boundary of the frame, the end boundary of the frame, and the type of the frame, which are three components of a frame specification.
所以我想知道使用未指定的框架是否安全,例如:
import org.apache.spark.sql.expressions.Window
val df = (1 to 10000).toDF("i")
df
.select(
$"i",
sum($"i").over(Window.orderBy($"i")).as("running_sum1"),//unspecified frame
sum($"i").over(Window.orderBy($"i").rowsBetween(Window.unboundedPreceding, Window.currentRow)).as("running_sum2") // specified frame
)
.show()
+---+------------+------------+
| i|running_sum1|running_sum2|
+---+------------+------------+
| 1| 1| 1|
| 2| 3| 3|
| 3| 6| 6|
| 4| 10| 10|
| 5| 15| 15|
| 6| 21| 21|
| 7| 28| 28|
| 8| 36| 36|
| 9| 45| 45|
| 10| 55| 55|
| 11| 66| 66|
| 12| 78| 78|
| 13| 91| 91|
| 14| 105| 105|
| 15| 120| 120|
| 16| 136| 136|
| 17| 153| 153|
| 18| 171| 171|
| 19| 190| 190|
| 20| 210| 210|
+---+------------+------------+
显然它们给出了相同的输出,但是在某些情况下使用未指定的框架会很危险吗?顺便使用 Spark 2.x。
是的,很安全。
查看[=17=上Window
对象的master分支的源代码,有如下注释(2.3.0分支中不存在):
When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
换句话说,当window上有一个排序时,即使用orderBy
,框架上未指定的边界等于有:
rowsBetween(Window.unboundedPreceding, Window.currentRow)
在不使用orderBy
的情况下,默认为无界实体window:
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
进一步调查表明,自从 Spark 1.4.0 中引入 window 函数以来,这些默认值已被使用,相关 github branch:
def defaultWindowFrame(
hasOrderSpecification: Boolean,
acceptWindowFrame: Boolean): SpecifiedWindowFrame = {
if (hasOrderSpecification && acceptWindowFrame) {
// If order spec is defined and the window function supports user specified window frames,
// the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SpecifiedWindowFrame(RangeFrame, UnboundedPreceding, CurrentRow)
} else {
// Otherwise, the default frame is
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
SpecifiedWindowFrame(RowFrame, UnboundedPreceding, UnboundedFollowing)
}
}