Hive/pyspark：为庞大的数据集转换非数字数据

Question

我正在寻找一种在 hive 或 pyspark 中使用以下结构旋转输入数据集的方法，输入包含超过 50 亿条记录，每个 emp_id 有 8 行和 5列可能，所以我最终会得到 40 列。我确实提到了 but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link，但是 sql 变得非常大（这并不重要），但是是否有很多方法可以将生成的旋转列与排名连接起来。

输入

emp_id,  dept_id,   dept_name, rank
1001,   101,        sales,      1
1001,   102,        marketing,  2
1002    101,        sales       1
1002    102,        marketing,  2

预期输出

emp_id,     dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001,       101,        sales,      102,        marketing
1002,       101,        sales,      102,        marketing

Answer 1

您可以在旋转后使用聚合，您可以选择像这样重命名列

import pyspark.sql.functions as F

(df
    .groupBy('emp_id')
    .pivot('rank')
    .agg(
        F.first('dept_id').alias('dept_id'),
        F.first('dept_name').alias('dept_name')
    )
    .show()
)

# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# |  1002|      101|      sales|      102|  marketing|
# |  1001|      101|      sales|      102|  marketing|
# +------+---------+-----------+---------+-----------+

Hive/pyspark：为庞大的数据集转换非数字数据

Hive/pyspark: pivot non numeric data for huge dataset

hive

pivot

pyspark