如何在不排序的情况下获取spark中的top-k频繁词?
How to get the top-k frequent words in spark without sorting?
在spark中,我们可以很方便地使用map reduce来统计单词出现的时间,并使用sort来得到top-k个频繁出现的单词,
// Sort locally inside node, keep only top-k results,
// no network communication
val partialTopK = wordCount.mapPartitions(it => {
val a = it.toArray
a.sortBy(-_._2).take(10).iterator
}, true)
// Collect local top-k results, faster than the naive solution
val collectedTopK = partialTopK.collect
collectedTopK.size
// Compute global top-k at master,
// no communication, everything done on the master node
val topK = collectedTopK.sortBy(-_._2).take(10)
但我想知道是否有更好的解决方案完全避免排序?
我想你想要takeOrdered
Returns the first k (smallest) elements from this RDD as defined by
the specified implicit Ordering[T] and maintains the ordering.
或top
Returns the top k (largest) elements from this RDD as defined by the
specified implicit Ordering[T].
还有其他几个 SO questions/answers 似乎也至少部分重复
在spark中,我们可以很方便地使用map reduce来统计单词出现的时间,并使用sort来得到top-k个频繁出现的单词,
// Sort locally inside node, keep only top-k results,
// no network communication
val partialTopK = wordCount.mapPartitions(it => {
val a = it.toArray
a.sortBy(-_._2).take(10).iterator
}, true)
// Collect local top-k results, faster than the naive solution
val collectedTopK = partialTopK.collect
collectedTopK.size
// Compute global top-k at master,
// no communication, everything done on the master node
val topK = collectedTopK.sortBy(-_._2).take(10)
但我想知道是否有更好的解决方案完全避免排序?
我想你想要takeOrdered
Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering.
或top
Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T].
还有其他几个 SO questions/answers 似乎也至少部分重复