为什么 unix_timestamp 在 12 小时后错误地解析了这个？

Question

以下内容似乎不正确 (spark.sql)：

select unix_timestamp("2017-07-03T12:03:56", "yyyy-MM-dd'T'hh:mm:ss")
-- 1499040236

相比于：

select unix_timestamp("2017-07-03T00:18:31", "yyyy-MM-dd'T'hh:mm:ss")
-- 1499041111

很明显，前者是后者。第二个似乎是正确的：

# ** R Code **
# establish constants
one_day = 60 * 60 * 24
one_year = 365 * one_day
one_year_leap = 366 * one_day
one_quad = 3 * one_year + one_year_leap

# to 2014-01-01
11 * one_quad +
  # to 2017-01-01
  2 * one_year + one_year_leap + 
  # to 2017-07-01
  (31 + 28 + 31 + 30 + 31 + 30) * one_day + 
  # to 2017-07-03 00:18:31
  2 * one_day + 18 * 60 + 31
# [1] 1499041111

类似的计算表明第一个应该是1499083436（由R中的as.integer(as.POSIXct('2017-07-03 12:03:56', tz = 'UTC'))确认），1499040236应该对应于2017-07-03 00:03:56.

那么这里发生了什么？它看起来确实像一个错误。最后两次完整性检查 -- select unix_timestamp("2017-07-03T00:03:56", "yyyy-MM-dd'T'hh:mm:ss") 正确 returns 1499040236；并将中间的 T 替换为 space </code> 对不正确的解析没有影响。</p> <hr> <p>由于它似乎已在开发中修复，我会注意到这是在 <code>2.1.1。

Answer 1

只是格式错误：

您的数据采用 0-23 小时格式（在 SimpleDateFormat 中表示为 HH）。
您使用 hh 格式，对应 1-24 小时格式。

事实上，在最新的 Spark 版本 (2.3.0 RC1) 中它根本无法解析：

spark.version

String = 2.3.0

spark.sql("""
  select unix_timestamp("2017-07-03T00:18:31", "yyyy-MM-dd'T'hh:mm:ss")""").show

+----------------------------------------------------------+
|unix_timestamp(2017-07-03T00:18:31, yyyy-MM-dd'T'hh:mm:ss)|
+----------------------------------------------------------+
|                                                      null|
+----------------------------------------------------------+

为什么 unix_timestamp 在 12 小时后错误地解析了这个？

Why is unix_timestamp parsing this incorrectly by 12 hours off?

unix-timestamp

apache-spark

apache-spark-sql