Spark 数据框添加缺失值

Question

我有以下格式的数据框。我想为每个客户添加缺少时间戳的空行。

+-------------+----------+------+----+----+
| Customer_ID | TimeSlot |  A1  | A2 | An |
+-------------+----------+------+----+----+
| c1          |        1 | 10.0 |  2 |  3 |
| c1          |        2 | 11   |  2 |  4 |
| c1          |        4 | 12   |  3 |  5 |
| c2          |        2 | 13   |  2 |  7 |
| c2          |        3 | 11   |  2 |  2 |
+-------------+----------+------+----+----+

结果 table 的格式应为

+-------------+----------+------+------+------+
| Customer_ID | TimeSlot |  A1  |  A2  |  An  |
+-------------+----------+------+------+------+
| c1          |        1 | 10.0 | 2    | 3    |
| c1          |        2 | 11   | 2    | 4    |
| c1          |        3 | null | null | null |
| c1          |        4 | 12   | 3    | 5    |
| c2          |        1 | null | null | null |
| c2          |        2 | 13   | 2    | 7    |
| c2          |        3 | 11   | 2    | 2    |
| c2          |        4 | null | null | null |
+-------------+----------+------+------+------+

我有 100 万客户和 360（在上面的示例中仅描述了 4 个）时间段。我想出了一种方法来创建一个具有 2 列 (Customer_id,Timeslot) 和 (1 M x 360 行) 并与原始数据框进行左外连接的方法。

有更好的方法吗？

Answer 1

您可以将其表示为 SQL 查询：

select df.customerid, t.timeslot,
       t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
     (select distinct timeslot from df) t left join
     df
     on df.customerid = c.customerid and df.timeslot = t.timeslot;

备注：

你应该把它放到另一个数据框中。
您可能有可用客户 and/or 时间段的表格。使用那些而不是子查询。

Answer 2

我认为可以使用 gordon linoff 的答案，但您可以添加以下内容，因为您说有数百万客户并且您正在加入他们。

对时间段使用计数 table？？因为它可能会提供更好的性能。更多可用性请参考以下link

http://www.sqlservercentral.com/articles/T-SQL/62867/

而且我认为您应该使用分区或行号函数根据某个分区值来划分列 customerid 和 select 客户。例如，仅 select 行号值，然后与计数 table 交叉连接。它可以提高你的表现。

Spark 数据框添加缺失值

Spark dataframe add Missing Values

sql

apache-spark

apache-spark-sql

pyspark

spark-dataframe