使用 unix_timestamp 方法在 spark 中创建时间戳
Using unix_timestamp method in creating timestamp in spark
我有一个 csv 文件。它有很多列,其中两个是月份和年份。月份表示为 1...12,而年份 2013..(示例)。我需要以 mm/yyyy 的格式创建时间戳作为新列,例如 'timestamp'。我尝试了以下代码片段,但失败了。
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/AWI/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf
scala> def makeDT(Month: String, Year: String) = s"$Month $Year"
makeDT: (Month: String, Year: String)String
scala> val makeDt = udf(makeDT(_:String,_:String))
makeDt: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function2>,StringType,Some(List(StringType,
StringType)))
scala> df.select($"Month", $"Year", unix_timestamp(makeDt($"Month",
$"Year"), "mm/yyyy")).show(2)
+-----+----+-----------------------------------------+
|Month|Year|unix_timestamp(UDF(Month, Year), mm/yyyy)|
+-----+----+-----------------------------------------+
| 1|2013| null|
| 1|2013| null|
+-----+----+-----------------------------------------+
only showing top 2 rows
scala>
谁能指出我哪里错了??
您需要日、月和年来构建时间戳。
您可以重新定义您的 makeMT:
scala>def makeMT(Month: String, Year: String) = s"00/$Month/$Year 00:00:00"
然后就可以像下面这样使用了(我没测试):
(unix_timestamp(makeDt($"Month", $"Year"), "dd/M/yyyy HH:mm:ss") * 1000).cast("timestamp")
我有一个 csv 文件。它有很多列,其中两个是月份和年份。月份表示为 1...12,而年份 2013..(示例)。我需要以 mm/yyyy 的格式创建时间戳作为新列,例如 'timestamp'。我尝试了以下代码片段,但失败了。
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/AWI/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf
scala> def makeDT(Month: String, Year: String) = s"$Month $Year"
makeDT: (Month: String, Year: String)String
scala> val makeDt = udf(makeDT(_:String,_:String))
makeDt: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function2>,StringType,Some(List(StringType,
StringType)))
scala> df.select($"Month", $"Year", unix_timestamp(makeDt($"Month",
$"Year"), "mm/yyyy")).show(2)
+-----+----+-----------------------------------------+
|Month|Year|unix_timestamp(UDF(Month, Year), mm/yyyy)|
+-----+----+-----------------------------------------+
| 1|2013| null|
| 1|2013| null|
+-----+----+-----------------------------------------+
only showing top 2 rows
scala>
谁能指出我哪里错了??
您需要日、月和年来构建时间戳。 您可以重新定义您的 makeMT:
scala>def makeMT(Month: String, Year: String) = s"00/$Month/$Year 00:00:00"
然后就可以像下面这样使用了(我没测试):
(unix_timestamp(makeDt($"Month", $"Year"), "dd/M/yyyy HH:mm:ss") * 1000).cast("timestamp")