火花按键计算不同的值

spark counting distinct values by key

我是新手,知道以下命令。他们按键给出值计数,按键给出值列表。

dayToHostPairTuple.countByKey()
dayToHostPairTuple.groupByKey()

是否有任何简单的 countByKey 替代方法,它只会按键计算不同的值?

#########################################==

下面的代码适合我。它基于我收到的答案。

dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
dayToHostPairTuple=dayToHostPairTuple.sortByKey()
print dayToHostPairTuple.distinct().countByKey()

假设值是可散列的,你可以使用 distinctcountByKey:

dayToHostPairTuple.distinct().countByKey()

reduceByKey:

from operator import add

dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)

我会建议

dayToHostPairTuple.countApproxDistinctByKey(0.005)

来自帮助:

Return approximate number of distinct values for each key in this RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here. relativeSD - Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017