什么是函数式数据工程方法中的 "dimension snapshot" 来处理缓慢变化的维度？

What is a "dimension snapshot" in a functional data engineering approach to deal with slowly changing dimensions?

正如 Maxime Beauchemin 在其广受欢迎的 post Functional Data Engineering — a modern paradigm for batch data processing 中提到的那样，Maxime 建议通过获取维度快照来处理缓慢变化的维度，在每个 ETL 计划中附加一个新分区。

But how do we model this in a functional data warehouse without mutating data? Simple. With dimension snapshots where a new partition is appended at each ETL schedule. The dimension table becomes a collection of dimension snapshots where each partition contains the full dimension as-of a point in time.

我试图在评论中挖掘答案，但找不到简单的解释。拍摄维度快照并将其附加到每日分区是什么意思？

快照是一个 table 值，该值是某些 base/variable table 截至某些 time/version 的值。这里的"dimension snapshot"好像是"dimension table snapshot"的意思。作者似乎提出了一种新制度，其维度 table 就像旧制度一样，增加了 time/version column/dimension 并在其上进行分区。在旧制度下，我们在特定时间对维度 table 进行更新，每个版本。在新制度下，每个 time/version 我们都会将旧维度 table 更新为某个新状态，我们取而代之的是对该状态进行（快照），将新列集添加到 time/version，并将这些行添加到新维度 table.

问作者，最近的博客。

Maxime 的这篇演讲可能有助于阐明建议的概念和方法。它包含一个实际示例： https://youtu.be/4Spo2QRTz1k?t=952

TLDW/R；该方法是每当您构建维度 tables 时，您存储它的整个输出（而不是说更新现有的 table）。如果您的模型中还没有添加某种基于时间的分区列（例如 as_of_date，它在存储方面转化为 dimensions/customer/as_of_date=20210726/* 之类的内容），以便您可以查询所有那些次元快照

什么是函数式数据工程方法中的 "dimension snapshot" 来处理缓慢变化的维度？

What is a "dimension snapshot" in a functional data engineering approach to deal with slowly changing dimensions?

database

postgresql

etl

functional-programming

airflow