火花按键计算不同的值
spark counting distinct values by key
我是新手,知道以下命令。他们按键给出值计数,按键给出值列表。
dayToHostPairTuple.countByKey()
dayToHostPairTuple.groupByKey()
是否有任何简单的 countByKey 替代方法,它只会按键计算不同的值?
#########################################==
下面的代码适合我。它基于我收到的答案。
dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
dayToHostPairTuple=dayToHostPairTuple.sortByKey()
print dayToHostPairTuple.distinct().countByKey()
假设值是可散列的,你可以使用 distinct
和 countByKey
:
dayToHostPairTuple.distinct().countByKey()
或reduceByKey
:
from operator import add
dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)
我会建议
dayToHostPairTuple.countApproxDistinctByKey(0.005)
来自帮助:
Return approximate number of distinct values for each key in this RDD.
The algorithm used is based on streamlib's implementation of
"HyperLogLog in Practice: Algorithmic Engineering of a State of The
Art Cardinality Estimation Algorithm", available here.
relativeSD - Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017
我是新手,知道以下命令。他们按键给出值计数,按键给出值列表。
dayToHostPairTuple.countByKey()
dayToHostPairTuple.groupByKey()
是否有任何简单的 countByKey 替代方法,它只会按键计算不同的值?
#########################################==下面的代码适合我。它基于我收到的答案。
dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
dayToHostPairTuple=dayToHostPairTuple.sortByKey()
print dayToHostPairTuple.distinct().countByKey()
假设值是可散列的,你可以使用 distinct
和 countByKey
:
dayToHostPairTuple.distinct().countByKey()
或reduceByKey
:
from operator import add
dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)
我会建议
dayToHostPairTuple.countApproxDistinctByKey(0.005)
来自帮助:
Return approximate number of distinct values for each key in this RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here. relativeSD - Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017