Apache Spark 在 reduceByKey 步骤上变慢
Apache Spark slow on reduceByKey step
我在 /usr/local/share/data/ 中有一个 2MB 的纯文本文件。
然后我 运行 针对 Apache Spark 中的以下代码。
conf = SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
doc_rdd = sc.textFile("/usr/local/share/data/")
unigrams = doc_rdd.flatMap(word_tokenize)
step1 = unigrams.flatMap(word_pos_tagging)
step2 = step1.filter(lambda x: filter_punctuation(x[0]))
step3 = step2.map(lambda x: (x, 1))
freq_unigrams = step3.reduceByKey(lambda x, y: x + y)
预期结果
[((u'showing', 'VBG'), 24), ((u'Ave', 'NNP'), 1), ((u'Scrilla364', 'NNP'), 1), ((u'internally', 'RB'), 4), ...]
但是 return 需要很长时间(6 分钟)才能达到预期的字数。
它停留在 reduceByKey 步骤。
如何解决此性能问题?
-- 参考--
硬件规格
型号名称:MacBook Air 型号
标识符:MacBookAir4,2
处理器名称:英特尔酷睿 i7 处理器
速度:1.8 GHz
处理器数量:1
核心总数:2
L2 缓存(每核心):256 KB
三级缓存:4MB
内存:4 GB
日志
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:0+873602
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:873602+873602
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:53478 in memory (size: 4.1 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:53478 in memory (size: 4.6 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 4
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 3
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:53478 in memory (size: 3.9 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 2
15/10/02 16:10:05 INFO PythonRDD: Times: total = 292892, boot = 8, init = 275, finish = 292609
15/10/02 16:10:05 INFO Executor: Finished task 1.0 in stage 3.0 (TID 4). 2373 bytes result sent to driver
15/10/02 16:10:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 292956 ms on localhost (1/2)
15/10/02 16:10:35 INFO PythonRDD: Times: total = 322562, boot = 5, init = 276, finish = 322281
15/10/02 16:10:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2373 bytes result sent to driver
15/10/02 16:10:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 322591 ms on localhost (2/2)
代码看起来不错。
您可以尝试几个选项来提高性能。
SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g")
local
-> local[*]
如果任务中断 - 它可以占用机器上可用的核心数
如果可能的话增加程序可用的内存
P.S。并欣赏 Spark - 你应该有大量的数据,这样你就可以 运行 它在集群上
我在 /usr/local/share/data/ 中有一个 2MB 的纯文本文件。 然后我 运行 针对 Apache Spark 中的以下代码。
conf = SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
doc_rdd = sc.textFile("/usr/local/share/data/")
unigrams = doc_rdd.flatMap(word_tokenize)
step1 = unigrams.flatMap(word_pos_tagging)
step2 = step1.filter(lambda x: filter_punctuation(x[0]))
step3 = step2.map(lambda x: (x, 1))
freq_unigrams = step3.reduceByKey(lambda x, y: x + y)
预期结果
[((u'showing', 'VBG'), 24), ((u'Ave', 'NNP'), 1), ((u'Scrilla364', 'NNP'), 1), ((u'internally', 'RB'), 4), ...]
但是 return 需要很长时间(6 分钟)才能达到预期的字数。 它停留在 reduceByKey 步骤。 如何解决此性能问题?
-- 参考--
硬件规格
型号名称:MacBook Air 型号 标识符:MacBookAir4,2 处理器名称:英特尔酷睿 i7 处理器 速度:1.8 GHz 处理器数量:1 核心总数:2 L2 缓存(每核心):256 KB 三级缓存:4MB 内存:4 GB
日志
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:0+873602
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:873602+873602
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:53478 in memory (size: 4.1 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:53478 in memory (size: 4.6 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 4
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 3
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:53478 in memory (size: 3.9 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 2
15/10/02 16:10:05 INFO PythonRDD: Times: total = 292892, boot = 8, init = 275, finish = 292609
15/10/02 16:10:05 INFO Executor: Finished task 1.0 in stage 3.0 (TID 4). 2373 bytes result sent to driver
15/10/02 16:10:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 292956 ms on localhost (1/2)
15/10/02 16:10:35 INFO PythonRDD: Times: total = 322562, boot = 5, init = 276, finish = 322281
15/10/02 16:10:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2373 bytes result sent to driver
15/10/02 16:10:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 322591 ms on localhost (2/2)
代码看起来不错。
您可以尝试几个选项来提高性能。
SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g")
local
-> local[*]
如果任务中断 - 它可以占用机器上可用的核心数
如果可能的话增加程序可用的内存
P.S。并欣赏 Spark - 你应该有大量的数据,这样你就可以 运行 它在集群上