PySpark: TypeError: 'Column' object is not callable

Question

我正在从 HDFS 加载数据，我想按特定变量过滤这些数据。但是不知何故 Column.isin 命令不起作用。它抛出这个错误：

TypeError: 'Column' object is not callable

from pyspark.sql.functions import udf, col
variables = ('852-PI-769', '812-HC-037', '852-PC-571-OUT')
df = sqlContext.read.option("mergeSchema", "true").parquet("parameters.parquet")
same_var = col("Variable").isin(variables)
df2 = df.filter(same_var)

架构如下所示：

df.printSchema()
root
 |-- Time: timestamp (nullable = true)
 |-- Value: float (nullable = true)
 |-- Variable: string (nullable = true)

知道我做错了什么吗？ PS：它是带有 Jupyter Notebook 的 Spark 1.4。

Answer 1

请使用以下代码检查

df.filter(df.Variable.isin(['852-PI-769', '812-HC-037', '852-PC-571-OUT']))

Answer 2

问题是 isin 已在 1.5.0 版中添加到 Spark，因此在您的 Spark 版本中尚不可用，如 isin here 的文档所示。

有一个类似的功能in in the Scala API that was introduced in 1.3.0 which has a similar functionality (there are some differences in the input since in only accepts columns). In PySpark this function is called inSet。文档中的使用示例：

df[df.name.inSet("Bob", "Mike")]
df[df.age.inSet([1, 2, 3])]

注意：inSet 在 1.5.0 及以后的版本中被删除，isin 应该在较新的版本中使用。

PySpark: TypeError: 'Column' object is not callable

PySpark: TypeError: 'Column' object is not callable

python

apache-spark

pyspark

spark-dataframe