PySpark 将 str 转换为 TimestampType

Question

你好我忠实的编码员，

首先，我需要您知道，我已经尝试了很多您可能会在您最喜欢的搜索引擎的首页上找到的解决方案。它涉及错误：

TypeError: field dt: TimestampType can not accept object '2021-05-01T09:19:46' in type <class 'str'>

我的数据以 raw.csv 的形式存储在 Amazon S3 存储桶中，看起来像：

2021-05-01T09:19:46,...
2021-05-01T09:19:42,...
2021-05-01T09:19:39,...

我试过：

from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext

df = GlueContext(SparkContext.getOrCreate()).create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={ 'paths': ["s3://bucket/to/raw.csv"] },
        format="csv",
        format_options={'withHeader': True}
    ).toDF()
events_schema = StructType([
    StructField("dt", TimestampType(), nullable=False),
    # and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss"))\
    .show(1, False)

和

df.withColumn("dt", unix_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")\
        .cast("double")
        .cast("timestamp"))\
    .show(1, False)

我仍然遇到完全相同的错误。

Answer 1

尝试将 dt 读取为 stringtype 然后使用 df.withColumn 转换为 timestamptype

Example:

events_schema = StructType([
    StructField("dt", StringType(), nullable=False),
    # and many other columns
])

df = session.createDataFrame(df.rdd, schema=events_schema)
df.show(10,False)
#+-------------------+
#|dt                 |
#+-------------------+
#|2021-05-01T09:19:46|
#+-------------------+

df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")).show()
#+-------------------+
#|                 dt|
#+-------------------+
#|2021-05-01 09:19:46|
#+-------------------+

PySpark 将 str 转换为 TimestampType

PySpark convert str to TimestampType

timestamp

casting

amazon-s3

pyspark