Apache Spark 联合方法给出莫名其妙的结果

Question

我正在使用 Apache Spark 玩 Moby Word 的列表，这里是 file。我首先使用这个文本文件创建了一个 RDD

    lines = sc.textFile("words.txt")

然后创建了两个包含 "p" 和 "s" 的单词的 RDD

    plines = lines.filter(lambda x: "p" in x)
    slines = lines.filter(lambda x: "s" in x)

然后创建了这两者的联合

    union_list = slines.union(plines)

然后我用 "count" 方法计算了每个列表中的单词数，结果分别为 64803、22969 和 87772，分别为 slines、plines 和 union_list。 还有 64803+22969=87772，这意味着没有同时包含 "p" 和 "s" 的单词。 我创建了一个包含包含 "p" 单词的新 RDD 和 "s" 使用

    pslines = lines.filter(lambda x: ("p" in x) and ("s" in x))

并计算给出 13616 的元素，然后创建一个新的 RDD，其中包含带有 "p" 或 "s"

的单词

    newlist = lines.filter(lambda x: ("p" in x) or ("s" in x))

并计算了给出 74156 的元素，这是有道理的，因为 64803+22969-13616=74156。 我在 union 方法中做错了什么？我在 Windows 10 和 Python 3.5.1.

上使用 Spark 1.6

Answer 1

union() method is not a set union operation. It just concatenats two RDD's, so the intersection will be counted twice. If you want the true set union, you need to run distinct() 在你的结果 RDD 上：

union_list = slines.union(plines).distinct()

Apache Spark 联合方法给出莫名其妙的结果

Apache Spark union method giving inexplicable result

python-3.x

apache-spark

pyspark