Databricks Type 2 / SLCD 通过 Spark 结构化流更新

Question

我已经多次在 DWH 上批量实施缓慢变化的维度，从而可以处理一组大于 1 的给定业务键的变化。没有汗水。

使用以下内容：

Spark 非结构化流程序
ETL 工具
PL/SQL

但是，据我所知，由于多步骤性质和 Spark 结构化流的普遍存在的局限性，使用结构化流是不可能的。

或者这可能吗？如果有请告知有没有方法？

Answer 1

是的，这是可能的，但您需要一些代码来实现它。从您的更新数据框中，您需要创建一个联合：

自行更新，将有一个完整的合并密钥 - 它们将匹配您设置的 current = false 和 end_date = date_of_new_record
与目标 table 进行内连接的结果，但将合并键设置为 NULL，因此它们将不匹配并且将作为新行插入 current = true 和 end_date = null

代码来自官方documentation (and notebook):

-- These rows will either UPDATE the current addresses of existing 
-- customers or INSERT the new addresses of new customers

SELECT updates.customerId as mergeKey, updates.* FROM updates

UNION ALL
 
-- These rows will INSERT new addresses of existing customers 
-- Setting the mergeKey to NULL forces these rows 
— to NOT MATCH and be INSERTed.

SELECT NULL as mergeKey, updates.*
FROM updates JOIN customers
ON updates.customerid = customers.customerid 
WHERE customers.current = true 
  AND updates.address <> customers.address

然后这个生成的数据帧用于从 .foreachBatch

调用的 MERGE 语句中

Databricks Type 2 / SLCD 通过 Spark 结构化流更新

Databricks Type 2 / SLCD Updates via Spark Structured Streaming

apache-spark

databricks

delta-lake