排序可变数量 columns/rows
Sorting variable number of columns/rows
users_grpd = pairs.groupByKey()
users_grpd_flattened = users_grpd.map(
lambda (k, vals): "{0} {1}".format(k, ' '.join(str(x) for x in vals)))
userid
中的第一列和其余列是产品 ID。我现在想对每个用户的产品 ID 进行排序。每个用户的产品数量不固定,但会有所不同。这是 users_grpd_flattened
的样子:有没有办法有效地对产品 ids/user 进行排序?
userid product ids.............
30095212 208518 10519 208520 120821
3072220 20506 205037
209212 208518 10519 208520 120821
100222 20506 205037 10519 208520 120821 20116 124574 102575
users_grpd.mapValues(sorted)
当您使用 mapValues
时,输入分区被保留,因此不涉及改组,并且最昂贵和潜在危险的操作在 groupByKey
.
之前
检查是否一切正常(is_sorted
取自 @WaiYipTung answer):
def is_sorted(l):
return all(l[i] <= l[i+1] for i in xrange(len(l)-1))
pairs = sc.parallelize([
(30095212, 208518), (30095212, 10519), (30095212, 208520),
(30095212, 120821), (3072220, 20506), (3072220, 205037),
(209212, 208518), (209212, 10519), (209212, 208520), (209212, 120821),
(100222, 20506), (100222, 205037), (100222, 10519), (100222, 208520),
(100222, 120821), (100222, 20116), (100222, 124574), (100222, 102575),
(87620, 12012851), (87620, 12022661), (87620, 12033827), (87620, 1205376)
])
users_grpd_with_sorted_vals = pairs.groupByKey().mapValues(sorted)
一些检查
>>> all(users_grpd_with_sorted_vals.values().map(is_sorted).collect())
True
>>> users_grpd_with_sorted_vals.lookup(87620)
[[1205376, 12012851, 12022661, 12033827]]
users_grpd = pairs.groupByKey()
users_grpd_flattened = users_grpd.map(
lambda (k, vals): "{0} {1}".format(k, ' '.join(str(x) for x in vals)))
userid
中的第一列和其余列是产品 ID。我现在想对每个用户的产品 ID 进行排序。每个用户的产品数量不固定,但会有所不同。这是 users_grpd_flattened
的样子:有没有办法有效地对产品 ids/user 进行排序?
userid product ids.............
30095212 208518 10519 208520 120821
3072220 20506 205037
209212 208518 10519 208520 120821
100222 20506 205037 10519 208520 120821 20116 124574 102575
users_grpd.mapValues(sorted)
当您使用 mapValues
时,输入分区被保留,因此不涉及改组,并且最昂贵和潜在危险的操作在 groupByKey
.
检查是否一切正常(is_sorted
取自 @WaiYipTung answer):
def is_sorted(l):
return all(l[i] <= l[i+1] for i in xrange(len(l)-1))
pairs = sc.parallelize([
(30095212, 208518), (30095212, 10519), (30095212, 208520),
(30095212, 120821), (3072220, 20506), (3072220, 205037),
(209212, 208518), (209212, 10519), (209212, 208520), (209212, 120821),
(100222, 20506), (100222, 205037), (100222, 10519), (100222, 208520),
(100222, 120821), (100222, 20116), (100222, 124574), (100222, 102575),
(87620, 12012851), (87620, 12022661), (87620, 12033827), (87620, 1205376)
])
users_grpd_with_sorted_vals = pairs.groupByKey().mapValues(sorted)
一些检查
>>> all(users_grpd_with_sorted_vals.values().map(is_sorted).collect())
True
>>> users_grpd_with_sorted_vals.lookup(87620)
[[1205376, 12012851, 12022661, 12033827]]