如何在 pyspark 列表达式中引用名称中带有连字符的列？

Question

我有一个 json 文档，形状如下（请注意，此架构不受我控制 - 我无法去掉键中的连字符）：

{
   "col1": "value1",
   "dictionary-a": {
      "col2": "value2"
   }
}

我使用 session.read.json(...) 将此 json 读入数据框（名为 'df'），如下所示：

df = session.read.json('/path/to/json.json')

我想这样做：

df2 = df.withColumn("col2", df.dictionary-a.col2)

我收到错误：

AttributeError: 'DataFrame' object has no attribute 'dictionary'

如何在 pyspark 列表达式中引用名称中带有连字符的列？

Answer 1

如您所见，df.dictionary-a.col2 中的连字符被计算为减法：df.dictionary - a.col2.

相反，您可以使用 pyspark.sql.functions.col to refer to the column by name and pyspark.sql.Column.getItem 按键访问字典的元素。

尝试：

from pyspark.sql.functions import col
df2 = df.withColumn("col2", col("dictionary-a").getItem("col2"))

How can I reference a column with a hyphen in its name in a pyspark column expression?