如何在 Pyspark 中将列作为逗号分隔参数传递
How to pass columns as comma separated parameters in Pyspark
我有一个包含数千列的数据框,我想将其传递给 greatest
函数而不单独指定列名。我该怎么做?
例如,我有 df
和 3 列,我将传递给 greatest
,每个列指定 df.x, df.y..
等等。
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
| 4|
+--------+
在上面的例子中我只有3列,但如果是几千个,就不可能一一列举了。我试过的几件事都没有用。我遗漏了一些关键的 python...
df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns
df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"
df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist
greatest
supports 位置参数*
pyspark.sql.functions.greatest(*cols)
(这就是为什么你可以 greatest(df.x,df.y,df.z)
)所以只是
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
df.select(greatest(*df.columns))
* 引用 Python glossary,位置参数是
... an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable preceded by *. For example, 3 and 5 are both positional arguments in the following calls:
complex(3, 5)
complex(*(3, 5))
此外:
- *args and **kwargs?
- What does ** (double star/asterisk) and * (star/asterisk) do for parameters?
我有一个包含数千列的数据框,我想将其传递给 greatest
函数而不单独指定列名。我该怎么做?
例如,我有 df
和 3 列,我将传递给 greatest
,每个列指定 df.x, df.y..
等等。
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
| 4|
+--------+
在上面的例子中我只有3列,但如果是几千个,就不可能一一列举了。我试过的几件事都没有用。我遗漏了一些关键的 python...
df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns
df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"
df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist
greatest
supports 位置参数*
pyspark.sql.functions.greatest(*cols)
(这就是为什么你可以 greatest(df.x,df.y,df.z)
)所以只是
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
df.select(greatest(*df.columns))
* 引用 Python glossary,位置参数是
... an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable preceded by *. For example, 3 and 5 are both positional arguments in the following calls:
complex(3, 5) complex(*(3, 5))
此外:
- *args and **kwargs?
- What does ** (double star/asterisk) and * (star/asterisk) do for parameters?