从 map 函数的输出中排除 "None"

Question

我有这个代码：

fileRDD.map(positive)\
       .map(lambda x: [x,1])\
       .reduceByKey(lambda x,y: x+y)\
       .take(10)

输出为：

[(None, 3194395),
 (0, 240597),
 (1, 224805),
 (2, 210585),
 (3, 198246),
 (4, 202869),
 (5, 92615),
 (6, 60493)]

如何从输出中删除 None 行？（我只需要 0 到 6 个结果）

Answer 1

通过在 RDD 上使用 filter 函数：

rdd = spark.sparkContext.parallelize([
    (None, 3194395), (0, 240597), (1, 224805),
    (2, 210585), (3, 198246), (4, 202869),
    (5, 92615), (6, 60493)
])

rdd1 = rdd.filter(lambda x: x[0] is not None)

print(rdd1.collect())
#[(0, 240597), (1, 224805), (2, 210585), (3, 198246), (4, 202869), (5, 92615), (6, 60493)]

从 map 函数的输出中排除 "None"

Excluding "None" from output of map function

apache-spark

rdd

pyspark