根据时间戳过滤数据流中的有界数据

Question

在我的数据流管道中，我将有两个 PCollections<TableRow> 从 BigQuery table 中读取。我计划将这两个 PCollections 合并为一个 PCollection 和 flatten.

由于 BigQuery 仅追加，目标是用新的 PCollection.

截断 BigQuery 中的第二个 table

我已经通读了文档，但我对中间步骤感到困惑。对于我的新 PCollection，计划是使用 Comparator DoFn 查看最大上次更新日期并 returning 给定行。 我不确定我是否应该使用过滤器转换，或者我是否应该按键进行分组然后使用过滤器？

所有 PCollection<TableRow> 将包含相同的值：IE：字符串、整数和时间戳。当谈到键值对时，大多数关于云数据流的文档只包含简单的字符串。 是否可以有一个键值对是 PCollection<TableRow> 的整行？

这些行看起来类似于：

customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00

在上面的示例中，我希望将 PCollection 过滤为 return 将写入 BigQuery 的 PCollection 的第二行。 此外，是否可以在不创建第四个 PCollection 的情况下将这些 Pardo 应用于第三个 PCollection？

Answer 1

你问了几个问题。我试图孤立地回答他们，但我可能误解了整个场景。如果您提供了一些示例代码，可能有助于澄清。

With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?

根据您的描述，您似乎想要获取 PCollection 个元素，并为每个 customerID（键）找到该客户记录的最新更新。您可以使用提供的转换通过 Top.largestPerKey(1, timestampComparator) 来完成此操作，您可以在其中将 timestampComparator 设置为仅查看时间戳。

Is it possible to have a key value pair that is the entire row of the PCollection?

A KV<K, V> 的键 (K) 和值 (V) 可以具有任何类型。如果你想按键分组，那么键的编码器需要是确定性的。 TableRowJsonCoder 不是确定性的，因为它可能包含任意对象。但听起来你想要 customerID 作为键，整个 TableRow 作为值。

is it possible to apply these Pardo's on the third PCollection without creating a fourth?

当您将 PTransform 应用于 PCollection 时，会产生一个新的 PCollection。没有办法解决这个问题，您不需要尽量减少管道中 PCollections 的数量。

一个PCollection是一个概念对象；它没有内在成本。您的管道将进行大量优化，以便许多中间 PCollections - 尤其是 ParDo 转换序列中的那些 - 无论如何都不会实现。

根据时间戳过滤数据流中的有界数据

Filtering bounded data in Dataflow based on timestamp

java

google-cloud-dataflow