Spark Streaming 中的 Append 模式和 Update 模式之间的真正区别是什么?
What is the real difference between Append mode and Update mode in Spark Streaming?
Spark Streaming 中 Append 模式和 Update 模式的真正区别是什么?
根据文档:
Append mode (default) - This is the default mode, where only the new
rows added to the Result Table since the last trigger will be
outputted to the sink. This is supported for only those queries where
rows added to the Result Table is never going to change. Hence, this
mode guarantees that each row will be output only once (assuming
fault-tolerant sink). For example, queries with only select, where,
map, flatMap, filter, join, etc. will support Append mode.
和
Update mode - (Available since Spark 2.1.1) Only the rows in the
Result Table that were updated since the last trigger will be
outputted to the sink. More information to be added in future
releases.
我对附加模式的困惑:它说 "only" 新行添加到结果 Table 因为最后一个触发器将被输出到接收器。所以,例如,假设我们有三行
r1, r2, r3
到达 t1, t2, t3
其中 t1<t2<t3
现在说在 t4 行 r2 被覆盖,如果是这样,当我们以追加模式运行时,我们将永远不会在接收器中看到它?不就是丢了一个write吗?
我对更新模式的困惑:它说 "only" 结果 table 中的行是updated 自上次触发将输出到接收器。这是否意味着行应该已经存在并且只有在更新现有行时才会输出到接收器?如果我们处于这种更新模式时没有现有行并且有新行进来,会发生什么?
仔细查看最新版本docs中追加模式的描述,我们看到它说
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
换句话说,永远不应该有任何覆盖。在你知道可以更新的场景中,使用更新模式。
关于更新模式的第二个问题,在docs中完整引用是
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
这里最后一句话很重要。相当于没有聚合时的Append模式(会进行实际的更新)。因此,在此模式下将正常添加新行,而不是简单地跳过。
为了完整起见,这是当前可用的第三种模式:
Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
documentation 包含不同查询类型和支持模式的列表以及一些有用的注释。
Spark Streaming 中 Append 模式和 Update 模式的真正区别是什么?
根据文档:
Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only select, where, map, flatMap, filter, join, etc. will support Append mode.
和
Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.
我对附加模式的困惑:它说 "only" 新行添加到结果 Table 因为最后一个触发器将被输出到接收器。所以,例如,假设我们有三行
r1, r2, r3
到达 t1, t2, t3
其中 t1<t2<t3
现在说在 t4 行 r2 被覆盖,如果是这样,当我们以追加模式运行时,我们将永远不会在接收器中看到它?不就是丢了一个write吗?
我对更新模式的困惑:它说 "only" 结果 table 中的行是updated 自上次触发将输出到接收器。这是否意味着行应该已经存在并且只有在更新现有行时才会输出到接收器?如果我们处于这种更新模式时没有现有行并且有新行进来,会发生什么?
仔细查看最新版本docs中追加模式的描述,我们看到它说
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
换句话说,永远不应该有任何覆盖。在你知道可以更新的场景中,使用更新模式。
关于更新模式的第二个问题,在docs中完整引用是
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
这里最后一句话很重要。相当于没有聚合时的Append模式(会进行实际的更新)。因此,在此模式下将正常添加新行,而不是简单地跳过。
为了完整起见,这是当前可用的第三种模式:
Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
documentation 包含不同查询类型和支持模式的列表以及一些有用的注释。