排序可变数量 columns/rows

Sorting variable number of columns/rows

users_grpd = pairs.groupByKey()

users_grpd_flattened = users_grpd.map(
    lambda (k, vals): "{0} {1}".format(k, ' '.join(str(x) for x in vals)))

userid中的第一列和其余列是产品 ID。我现在想对每个用户的产品 ID 进行排序。每个用户的产品数量不固定,但会有所不同。这是 users_grpd_flattened 的样子:有没有办法有效地对产品 ids/user 进行排序?

userid   product ids.............

30095212 208518 10519 208520 120821
3072220 20506 205037
209212 208518 10519 208520 120821
100222 20506 205037 10519 208520 120821 20116  124574 102575

您可以使用 mapValues with sorted.

users_grpd.mapValues(sorted)

当您使用 mapValues 时,输入分区被保留,因此不涉及改组,并且最昂贵和潜在危险的操作在 groupByKey.

之前

检查是否一切正常(is_sorted 取自 @WaiYipTung answer):

def is_sorted(l):
    return all(l[i] <= l[i+1] for i in xrange(len(l)-1))

pairs = sc.parallelize([
    (30095212, 208518), (30095212, 10519), (30095212, 208520), 
    (30095212, 120821), (3072220, 20506), (3072220, 205037),
    (209212, 208518), (209212, 10519), (209212, 208520), (209212, 120821),
    (100222, 20506), (100222, 205037), (100222, 10519), (100222, 208520),
    (100222, 120821), (100222, 20116), (100222, 124574), (100222, 102575),
    (87620, 12012851), (87620, 12022661), (87620, 12033827), (87620, 1205376)
])

users_grpd_with_sorted_vals = pairs.groupByKey().mapValues(sorted)

一些检查

>>> all(users_grpd_with_sorted_vals.values().map(is_sorted).collect())
True
>>> users_grpd_with_sorted_vals.lookup(87620)
[[1205376, 12012851, 12022661, 12033827]]