Spark SQL - 如何按小时查找交易总数

Question

例如，如果我有一个 table 和 transaction number 和 transaction date [作为 timestamp] 列，我如何找出交易总数hourly basis?

是否有任何 Spark sql 函数可用于这种范围计算？

Answer 1

这里我尝试给出一些方法的指针，比较完整的代码，请看这里

时间间隔文字：使用间隔文字，可以从日期或时间戳值中减去或添加任意时间量。当您想要从固定时间点添加或减去时间段时，此表示法很有用。例如，用户现在可以轻松地表达查询，例如 “查找过去一小时内发生的所有交易”。使用以下语法构造区间文字： [sql]INTERVAL值单位[/sql]

下面是python中的方法。您可以修改以下示例以符合您的要求，即相应的交易日期开始时间、结束时间。在你的情况下而不是 id 它的交易号码。

# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
  ("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
  ("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
  ("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
  df.start_time.cast("timestamp").alias("start_time"),
  df.end_time.cast("timestamp").alias("end_time"),
  df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
  (to_date(df.start_time) == to_date(df.end_time)) & \
  (df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time           |            end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2  |
+———————+———————+—+

使用此方法，您可以应用分组功能来查找您案例中的交易总数。

以上是python代码，scala呢？

expr function 上面使用的在 scala 中也可用

也看看下面描述了..

import org.apache.spark.sql.functions._
    val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
    val df2 = df1
      .withColumn( "diff_secs", diff_secs_col )
      .withColumn( "diff_mins", diff_secs_col / 60D )
      .withColumn( "diff_hrs",  diff_secs_col / 3600D )
      .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

Answer 2

您可以使用from_unixtime功能。

val sqlContext = new SQLContext(sc)

import org.apache.spark.sql.functions._
import sqlContext.implicits._

val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
      .groupBy('hour)
      .agg(count('transaction_number) as 'transactions)

结果：

+----+------------+
|hour|transactions|
+----+------------+
|  10|        1000|
|  12|        2000|
|  13|        3000|
|  14|        4000|
|  ..|        ....|
+----+------------+

Spark SQL - 如何按小时查找交易总数

Spark SQL - How to find total number of transactions on an hourly basis

apache-spark-sql

spark-dataframe