PySpark

Question

我正在尝试了解 pyspark.sql.currenttimestamp() 和 [=28 之间的行为差异=]datetime.now()

如果我使用这两种机制在 DataBricks 中创建一个 Spark 数据帧来创建时间戳列，一切都会按预期正常工作....

curDate2 = spark.range(10)\
  .withColumn("current_date_lit",F.lit(date.today()))\
  .withColumn("current_timestamp_lit",F.lit(F.current_timestamp()))\
  .withColumn("current_timestamp",F.current_timestamp())\
  .withColumn("now",F.lit(datetime.now()))

+---+----------------+---------------------+--------------------+--------------------+
| id|current_date_lit|current_timestamp_lit|   current_timestamp|                 now|
+---+----------------+---------------------+--------------------+--------------------+
|  0|      2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
|  1|      2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
|  2|      2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
+---+----------------+---------------------+--------------------+--------------------+

但是，几分钟后，当我在数据帧上调用 show() 时，基于 currenttimestamp() 的列显示现在的时间（16:44），同时datetime.now() 列显示了第一次创建数据帧的时间戳 (16:40)

很明显，一列包含文字值，另一列在运行时枚举函数，但我无法理解为什么它们的行为不同

show() 几分钟后...

+---+----------------+---------------------+--------------------+--------------------+
| id|current_date_lit|current_timestamp_lit|   current_timestamp|                 now|
+---+----------------+---------------------+--------------------+--------------------+
|  0|      2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
|  1|      2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
|  2|      2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
+---+----------------+---------------------+--------------------+--------------------+

谢谢 - 我希望这是有道理的！

Answer 1

current_timestamp() returns 一个 TimestampType 列，其值是在查询时计算的，如 docs 中所述。所以每次你打电话时 'computed' show。

Returns the current timestamp at the start of query evaluation as a TimestampType column. All calls of current_timestamp within the same query return the same value.

将此列传递给 lit 调用不会改变任何内容，如果您检查 source code 您可以看到 lit 只是 returns 您调用它的列与.

return col if isinstance(col, Column) else _invoke_function("lit", col)

如果你用列以外的东西来计算 lit，例如一个日期时间对象，然后使用此文字值创建一个新列。文字是从 datetime.now() 返回的日期时间对象。这是一个静态值，表示调用 datetime.now 函数的时间。

Answer 2

好问题，我尝试使用 rand() 函数只是为了检查一下。这有点直观，但与此同时，没有先验 .cache() 应用于某些数据的操作会让人相信，新一轮 --> 一组新结果。

show() 是一个有一些智慧的动作。这是基于相同的底层 rdd，从逻辑上讲，人们会期待确定性的结果——至少我是这么认为的。
但是，F.current_timestamp() 在序列化时计算一次。因此，两个连续的 show() 将有 2 个不同的时间。另一个答案指出并指向文档。所以这是一个例外，因此尝试使用 rand()。见下文。
Datetime.now() 由 Spark 保持不变 - 请参阅 WholeStageCodeGen - 它是如何工作的，因为它涉及相同的底层 DF；它假定第一个 lit 函数仍然适用，因为之前创建的 DF（底层 RDD）仍然存在。我检查了 rand() 和所有连续的 show() 操作 return 相同的随机数序列 - 使用相同的种子。这模拟了 确定性 行为，这是我们想要的 2 个连续 show() 的
有了同名的新DF，那显然也是re-evaluated。

你可以尝试看看如果你使用.cache()会发生什么。

当然是人为的例子range(10)。

PySpark - 时间戳行为

PySpark - Timestamp behavior

apache-spark

databricks