在 pyspark 中使用 window 函数 sql

Question

我有类似下面示例数据的数据。我正在尝试使用 PySpark 在我的数据中创建一个新列，这将是基于时间戳的客户的第一个事件的类别。就像下面的示例输出数据一样。

我在下面有一个例子，我认为使用 sql 中的 window 函数可以完成它。

我对 PySpark 还很陌生。我知道您可以在 PySpark 中运行 sql。我想知道下面的代码是否正确到运行 PySpark 中的 sql window 函数。那就是我想知道我是否可以像下面那样将 sql 代码粘贴到 spark.sql 中。

输入：

eventid customerid category timestamp
1       3          a        1/1/12
2       3          b        2/3/14
4       2          c        4/1/12

输出：

eventid customerid category timestamp first_event
1       3          a        1/1/12    a
2       3          b        2/3/14    a
4       2          c        4/1/12    c

window函数示例：

select eventid, customerid, category, timestamp 
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table



# implementing window function example with pyspark

PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“Operations”).getOrCreate()

# Register the DataFrame as a SQL temporary view

df.createOrReplaceView(“Table”)

sql_results = spark.sql(“select eventid, customerid, category, timestamp 
                FIRST_VALUE(catgegory) over(partition by customerid order by                timestamp) first_event
                from table”)

# display results
sql_results.show()

Answer 1

您也可以在 pyspark 中使用 window 函数

>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>> 
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
|      1|         3|       a|   1/1/12|
|      2|         3|       b|   2/3/14|
|      4|         2|       c|   4/1/12|
+-------+----------+--------+---------+

>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>> 
>>> df.show()
+-------+----------+--------+---------+-----------+                             
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
|      1|         3|       a|   1/1/12|          a|
|      2|         3|       b|   2/3/14|          a|
|      4|         2|       c|   4/1/12|          c|
+-------+----------+--------+---------+-----------+

在 pyspark 中使用 window 函数 sql

using window function sql inside pyspark

python-3.x

apache-spark

pyspark

pyspark-sql