如何将带有 'valid_from' 和 'valid_to' 列的 table 连接到带有时间戳的 table？

Question

我在 PySpark 工作，有一个 table 包含特定文章的销售数据，每个日期和文章一行：

#ARTICLES
+-----------+----------+
|timestamp  |article_id|
+-----------+----------+
| 2018-01-02|   1111111|
| 2018-01-02|   2222222|
| 2018-01-02|   3333333|
| 2018-01-03|   1111111|
| 2018-01-03|   2222222|
| 2018-01-03|   3333333|
+-----------+----------+

然后，我有一个较小的 table，其中包含每篇文章的价格数据。价格从某个日期到另一个日期有效，在最后两列中指定：

#PRICES
+----------+-----+----------+----------+
|article_id|price|from_date |to_date   |
+----------+-----+----------+----------+
|   1111111| 8.99|2000-01-01|2999-12-31|
|   2222222| 4.29|2000-01-01|2006-09-05|
|   2222222| 2.29|2006-09-06|2999-12-31|
+----------+-----+----------+----------+

在此处的最后两行中，您会看到此价格已于 2006 年 9 月 6 日下调。

我现在要加入第一个价格table。它必须是其各自时间戳上的价格。在此示例中，我想要以下结果：

#RESULT
+-----------+----------+-----+
|timestamp  |article_id|price|
+-----------+----------+-----+
| 2018-01-02|   1111111| 8.99|
| 2018-01-02|   2222222| 2.29|
| 2018-01-02|   3333333| null|
| 2018-01-03|   1111111| 8.99|
| 2018-01-03|   2222222| 2.29|
| 2018-01-03|   3333333| null|
+-----------+----------+-----+

我最好怎么做？

我的一个想法是 "roll out" 价格 table 每个时间戳和 article_id 包含一行，然后使用这两个键加入。但我不知道如何使用两个日期列推出 table。

Answer 1

加入 between 条件应该有效。

from pyspark.sql.functions import col
articles.alias('articles').join(prices.alias('prices'), 
   on=(
        (col('articles.article_id') == col('prices.article_id')) & 
        (col('articles.timestamp').between(col('prices.from_date'), col('prices.to_date')))
   ),
   how='left'
).select('articles.*','prices.price')

Answer 2

另一种选择是进行左连接并使用 pyspark.sql.functions.where() 来选择 price。

import pyspark.sql.functions as f
articles.alias("a").join(prices.alias("p"), on="article_id", how="left")\
    .where(
        f.col("p.article_id").isNull() |  # without this, it becomes an inner join
        f.col("timestamp").between(
            f.col("from_date"),
            f.col("to_date")
        )
    )\
    .select(
        "timestamp",
        "article_id",
        "price"
    )\
    .show()
#+----------+----------+-----+
#| timestamp|article_id|price|
#+----------+----------+-----+
#|2018-01-02|   1111111| 8.99|
#|2018-01-02|   2222222| 2.29|
#|2018-01-02|   3333333| null|
#|2018-01-03|   1111111| 8.99|
#|2018-01-03|   2222222| 2.29|
#|2018-01-03|   3333333| null|
#+----------+----------+-----+

Answer 3

这是实现您想要的结果的另一种方式

from pyspark.sql import functions as f
result = articles.alias('articles').join(prices.alias('prices'), (f.col('articles.article_id') == f.col('prices.article_id')) & (f.col('articles.timestamp') > f.col('prices.from_date')) & (f.col('articles.timestamp') < f.col('prices.to_date')), 'left')\
    .select('articles.*','prices.price')

result 应该是

+----------+----------+-----+
|timestamp |article_id|price|
+----------+----------+-----+
|2018-01-02|2222222   |2.29 |
|2018-01-03|2222222   |2.29 |
|2018-01-02|3333333   |null |
|2018-01-03|3333333   |null |
|2018-01-02|1111111   |8.99 |
|2018-01-03|1111111   |8.99 |
+----------+----------+-----+

如何将带有 'valid_from' 和 'valid_to' 列的 table 连接到带有时间戳的 table？

How to join a table with a 'valid_from' and 'valid_to' column to a table with a timestamp?

apache-spark

apache-spark-sql

pyspark

pyspark-sql