Scala：哪个更好 - distinct 然后 union 或 union 然后 distinct？

Question

如果我有两个列表：

listA - 100 万个字符串，
listB - 100 万个字符串

并且我想将它们合并到第三个 listC 中，它只有来自上述两个列表的唯一值，以下哪种方法更好：

在合并到 listC 之前对 listA 和 listB 进行 distinct() 或者，
生成并集化的 listC，然后在 listC 上应用 distinct

相同的逻辑是否也适用于数组？

Answer 1

让我们看看 distinct 实现：

def distinct: Repr = {
  val b = newBuilder
  val seen = mutable.HashSet[A]()
  for (x <- this) {
    if (!seen(x)) {
      b += x
      seen += x
    }
  }
  b.result()
}

出于性能原因，它使用可变结构。

因此，如果性能是个问题，您可以用同样的方式实现不同的联合：

def distinctUnion[E](listA: Seq[E], listB: Seq[E]): Seq[E] = {
  val b = Seq.newBuilder[E]
  val seen = mutable.HashSet[E]()
  for (x <- listA) {
    if (!seen(x)) {
      b += x
      seen += x
    }
  }
  for (x <- listB) {
    if (!seen(x)) {
      b += x
      seen += x
    }
  }
  b.result()
}

Answer 2

您需要先进行联合，但您想以惰性方式进行，以避免在内存中创建联合集合。这样的事情应该有效：

val listC = (listA.view union listB).distinct.toList

从计算上讲，这可能最终会做一些与 Jean 发布的非常相似的事情，但它更好一些，因为它更好地利用了 Scala 集合库。

Scala：哪个更好 - distinct 然后 union 或 union 然后 distinct？

Scala: Which is better - distinct then union or union then distinct?

arrays

union

scala

list

distinct