带有小数点的 map 和 flatMap 的行为

Question

考虑这个电影评分数据集 (userId,movieId,rating,timestamp)

1,1,4.0,964982703
1,3,4.0,964981247
1,223,3.0,964980985
1,231,5.0,964981179
1,1226,5.0,964983618
6,95,4.0,845553559
6,100,3.0,845555151
6,102,1.0,845555436
6,104,4.0,845554349
6,105,3.0,845553757
6,110,5.0,845553283
6,112,4.0,845553994
610,152081,4.0,1493846503
610,152372,3.5,1493848841
610,155064,3.5,1493848456
610,156371,5.0,1479542831
610,156726,4.5,1493848444
610,157296,4.0,1493846563
610,158238,5.0,1479545219
610,158721,3.5,1479542491
610,160080,3.0,1493848031
610,160341,2.5,1479545749
610,160527,4.5,1479544998

m = sc.textFile('movies/ratings_s.csv')

对于评分直方图，我知道我们可以这样做

scores = m.map(lambda line : line.split(',')[2])
sorted(scores.countByValue().items())

[('1.0', 1), ('2.5', 1), ('3.0', 4), ('3.5', 3), ('4.0', 7), ('4.5', 2), ('5.0', 5)]

我尝试了 flatMap 只是为了了解其中的区别：

scores = m.flatMap(lambda line : line.split(',')[2])
sorted(scores.countByValue().items())

我得到的结果是

[('.', 23), ('0', 17), ('1', 1), ('2', 1), ('3', 7), ('4', 9), ('5', 11)]

你能帮忙解释一下 flatMap 的行为吗:

flatMap的逻辑是什么？ "flatten" 是什么造成了这样的结果？
为什么去掉“.”分开并只保留不可或缺的部分？我们不要求用“.”拆分。
我应该如何取回分数为 .5 的小数结果？

Answer 1

flatMap 的逻辑是什么？ "flatten" 创造这样的结果是什么？

答案 - line.split(',')[2] returns 一个字符串。 Flatmap 将字符串压平（当你压平一个字符串时，你会得到字符，因为字符串是字符的组合）即，从字符串中创建字符，这就是你在输出中看到一个字符的原因。

为什么去掉“.”分开并只保留不可或缺的部分？我们不要求用“。”分开答案 - 上面的答案解释了为什么不去掉“.”

我应该如何取回 .5 的小数结果分数？答案 - 同样，上面的答案应该解释你需要做什么。对于数字的处理，可以进一步从字符串映射到数字，然后计算。

如果解决了请采纳答案

Answer 2

What is the logic that flatMap is doing ? What does it "flatten" to create such result?

flatMap 接受一个 return 是 "collection" 的函数（例如列表）。它本质上等同于执行 map 到 return 集合，进一步 flattened 到其各个元素。在您的 flatMap 示例中，函数 lambda line : line.split(',')[2] 将每一行转换为第三个拆分字符串，然后将其（被视为字符集合） flattened 转换为单个字符。

Why does it strip "." separately and keep only the integral part ? We are not asking to split with "."

由于 flatMap 的结果现在是每行第三个拆分字符串的单个字符的列表，countByValue() 将计算每个数字和小数点（作为字符)，因此报告的结果。

How should I get back the decimal results with .5 score ?

如果您想使用 flatMap 产生与您的 map 版本相同的结果：

m.map(lambda line : line.split(',')[2])

您需要使 lambda 函数 return 成为所选拆分字符串的适当集合，例如：

m.flatMap(lambda line : [line.split(',')[2]])

带有小数点的 map 和 flatMap 的行为

Behavior of map and flatMap with decimal point

python

mapreduce

flatmap

apache-spark

pyspark