PySpark Error: cannot resolve '`timestamp`'
PySpark Error: cannot resolve '`timestamp`'
我必须找到 Yelp 数据集中大多数签到发生的确切时间,但出于某种原因我 运行 遇到了这个错误。到目前为止,这是我的代码:
from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F
square_udf_int = udf(lambda z: square(z), IntegerType())
checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)
错误:
Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
+- Project [business_id#6, dates#172, checkin_date#176]
+- Generate explode(dates#172), false, [checkin_date#176]
+- Project [business_id#6, <lambda>(date#7) AS dates#172]
+- Relation[business_id#6,date#7] json
dates 只是一个 Spark 数据框,其中有一列名为:“checkin_date”,只有日期时间,所以我不确定为什么这不起作用。
您获得的错误只是意味着在下面的代码行中,您正在尝试访问名为 timestamp
的列,但该列不存在。
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))
确实,这里是 to_timestamp
函数的签名:
pyspark.sql.functions.to_timestamp(col, format=None)
第一个参数是列,第二个是格式。我假设您正在尝试解析日期然后截断它。假设您要将日期截断到月份级别。正确的做法是:
dates.withColumn("checkin_date", F.date_trunc('month',
F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))
我必须找到 Yelp 数据集中大多数签到发生的确切时间,但出于某种原因我 运行 遇到了这个错误。到目前为止,这是我的代码:
from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F
square_udf_int = udf(lambda z: square(z), IntegerType())
checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)
错误:
Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
+- Project [business_id#6, dates#172, checkin_date#176]
+- Generate explode(dates#172), false, [checkin_date#176]
+- Project [business_id#6, <lambda>(date#7) AS dates#172]
+- Relation[business_id#6,date#7] json
dates 只是一个 Spark 数据框,其中有一列名为:“checkin_date”,只有日期时间,所以我不确定为什么这不起作用。
您获得的错误只是意味着在下面的代码行中,您正在尝试访问名为 timestamp
的列,但该列不存在。
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))
确实,这里是 to_timestamp
函数的签名:
pyspark.sql.functions.to_timestamp(col, format=None)
第一个参数是列,第二个是格式。我假设您正在尝试解析日期然后截断它。假设您要将日期截断到月份级别。正确的做法是:
dates.withColumn("checkin_date", F.date_trunc('month',
F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))