如何在 Spark 中对 Row 对象的字段进行排序 (Python)

Question

我正在 Spark 中创建 Row 对象。我不希望我的字段按字母顺序排列。但是，如果我执行以下操作，它们将按字母顺序排列。

row = Row(foo=1, bar=2)

然后它创建一个如下所示的对象：

Row(bar=2, foo=1)

然后，当我在此对象上创建一个数据框时，列顺序将首先是 bar，其次是 foo，而我更喜欢相反的顺序。

我知道我可以使用“_1”和“_2”（分别用于 "foo" 和 "bar"），然后分配一个架构（使用适当的 "foo" 和 "bar" 名字）。但是有什么方法可以防止 Row 对象对它们进行排序吗？

Answer 1

Spark >= 3.0

已使用 SPARK-29748 删除字段排序（删除 PySpark 中的字段排序 SQL 行创建 Export)，遗留模式除外，当设置以下环境变量时：

PYSPARK_ROW_FIELD_SORTING_ENABLED=true

Spark < 3.0

But is there any way to prevent the Row object from ordering them?

没有。 If you provide kwargs arguments will sorted by name。确定性行为需要排序，因为 3.6 之前的 Python 不保留关键字参数的顺序。

只需使用普通元组：

rdd = sc.parallelize([(1, 2)])

并将架构作为参数传递给 (not to be confused with DataFrame.toDF):

rdd.toDF(["foo", "bar"])

或createDataFrame:

from pyspark.sql.types import *

spark.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])

spark.createDataFrame(rdd, schema)

你也可以使用namedtuples:

from collections import namedtuple

FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])

最后您可以按 select:

对列进行排序

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")

Answer 2

来自documentation：

Row also can be used to create another Row like class, then it could be used to create Row objects

在这种情况下，列的顺序被保存：

>>> FooRow = Row('foo', 'bar')
>>> row = FooRow(1, 2)
>>> spark.createDataFrame([row]).dtypes
[('foo', 'bigint'), ('bar', 'bigint')]

Answer 3

如何对原始模式进行排序以匹配 RDD 的字母顺序：

schema_sorted = StructType()
structfield_list_sorted = sorted(df.schema, key=lambda x: x.name)
for item in structfield_list_sorted:
    schema_sorted.add(item)

如何在 Spark 中对 Row 对象的字段进行排序 (Python)

How do I order fields of my Row objects in Spark (Python)

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql