使用列表理解生成 PySpark DataFrame

Question

我正在尝试使用以下列名列表生成 DataFrame：

cols = [
    'name',
    'age',
    'team',
    'column1',
    'column2',
    'column3',
    'column4',
    'column5',
    'column6',
]

rows = [Row(**{k: '1776-07-04'}) for k in cols]
df = spark.createDataFrame(rows)

如果我运行 df.columns，上面列表中的列将按预期返回。但是当我运行 df.show 时，我得到以下错误：

Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 9 fields are required while 1 values are provided.

所以最后我（有点）明白为什么我会收到这个错误，但我的印象是 1776-07-04 只会被分配给 any/all 值。我在这里错过了什么？

Answer 1

这是因为您正在创建一个包含 9 行的 Dataframe，但每行只有其中一列的数据。

创建单行并为所有 9 列分配值 '1776-07-04' 的正确方法是

>>> df = spark.createDataFrame([Row(**{k:'1776-07-04' for k in cols})], cols)
>>> df.show()
+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|      name|       age|      team|   column1|   column2|   column3|   column4|   column5|   column6|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|1776-07-04|1776-07-04|1776-07-04|1776-07-04|1776-07-04|1776-07-04|1776-07-04|1776-07-04|1776-07-04|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+

使用列表理解生成 PySpark DataFrame

Generate a PySpark DataFrame using list comprehension

python-3.x

pyspark

pyspark-sql