具有不同大小元组的 Rdd 到数据框

Rdd with tuples of different size to dataframe

我使用 pyspark map-reduce 方法创建了一个 rdd。我现在想从这个 rdd 创建一个数据框。 rdd 看起来像这样:

(491023, ((9,), (0.07971896408231094,), 'Debt collection'))
(491023, ((2, 14, 77, 22, 6, 3, 39, 7, 0, 1, 35, 84, 10, 8, 32, 13), (0.017180308460902963, 0.02751921818456658, 0.011887861159888378, 0.00859908577494079, 0.007521091815230704, 0.006522044953782423, 0.01032297079810829, 0.018976833302472455, 0.007634289723749076, 0.003033975857850723, 0.018805184361326378, 0.011217892399539534, 0.05106916198426676, 0.007901136066759178, 0.008895262042995653, 0.006665649645210911), 'Debt collection'))
(491023, ((36, 12, 50, 40, 5, 23, 58, 76, 11, 7, 65, 0, 1, 66, 16, 99, 98, 45, 13), (0.007528732561416072, 0.017248902490279026, 0.008083896178333739, 0.008274896865005982, 0.0210032206108319, 0.02048387345320946, 0.010225319903418824, 0.017842961406992965, 0.012026753813481164, 0.005154201637708568, 0.008274127579967948, 0.0168843021403551, 0.007416385430301767, 0.009257236955148311, 0.00590385362565239, 0.011031745337733267, 0.011076277004617665, 0.01575522984526745, 0.005431270081282964), 'Vehicle loan or lease'))

正如您在我的数据框中看到的那样,我必须有 4 个不同的列。第一个应该是 Int 491023,第二个是元组(我认为数据帧没有元组类型,所以数组也可以),第三个是另一个元组,第四个是字符串。如您所见,我的元组有不同的大小。 最简单的 command rdd.toDF() 对我不起作用。有什么想法可以实现吗?

您可以像下面这样创建数据框,最终您可以传递一个数组(ArrayType())/list

from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',['12','34'],1590038340000)],[ "reg","val1","val2"])

输出

+------+--------+-------------+
|   reg|    val1|         val2|
+------+--------+-------------+
|N110WA|[12, 34]|1590038340000|
+------+--------+-------------+

架构

df_a.printSchema()
root
 |-- reg: string (nullable = true)
 |-- val1: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |-- val2: long (nullable = true)