pyspark过滤行错误

Question

我正在尝试在 PySpark 中删除按 CustomerID 分组时计数小于 10 的客户行。因此，我首先获取计数 < 10 的客户的 CustomerID。然后我通过使用不在删除列表中的具有 CustomerID 的客户来过滤它。但是我得到 Py4JJavaError error。谁能告诉我如何正确执行此操作？

rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()

cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_user_1))

Answer 1

rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()

变量rm_user_1是Row类型。您需要访问行内的 CustomerID 值。列表理解就足够了：

rm_users = [x.CustomerID for x in rm_user_1]
cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_users))

pyspark过滤行错误

pyspark filtering rows error

python

apache-spark

pyspark

pyspark-sql