默认减速器数量

Question

在Hadoop中，如果我们没有设置reducer的个数，那么会创建多少个reducer？

映射器的数量取决于 （总数据大小）/（输入分割大小），例如。如果数据大小为 1 TB，输入拆分大小为 100 MB。那么映射器的数量将为 (1000*1000)/100 = 10000（万）。

减速器的数量取决于哪些因素？为一个作业创建了多少个减速器？

Answer 1

正确的reduce个数好像是0.95或者1.75乘以（节点数）*（每个节点的最大容器数）。

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

增加 reduce 的数量会增加框架开销，但会增加负载平衡并降低故障成本。

上面的比例因子略小于整数，以便在框架中为 speculative-tasks 和失败的任务保留一些减少槽。

本文也介绍了 Mapper 计数。

多少张地图？

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

地图的正确并行度似乎在 10-100 个地图左右 per-node，尽管它已被设置为 300 个地图 [=62] =] 地图任务。任务设置需要一段时间，因此最好至少花一分钟时间执行地图。

因此，如果您需要 10TB 的输入数据并且块大小为 128MB，您最终会得到 82,000 张地图，除非使用Configuration.set(MRJobConfig.NUM_MAPS, int)（仅向框架提供提示）将其设置得更高。

如果你想改变减速器数量的默认值1，你可以在下面设置属性（来自hadoop 2.x版本）作为命令行参数

mapreduce.job.reduces

或

您可以使用

以编程方式设置

job.setNumReduceTasks(integer_numer);

再看一个相关的 SE 问题：What is Ideal number of reducers on Hadoop?

Answer 2

默认情况下，减速器的数量设置为 1。

您可以通过添加参数来更改它

mapred.reduce.tasks 在命令行或驱动程序代码或您传递的 conf 文件中。

例如：命令行参数：bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks> 或者，在驱动程序代码中为：conf.setNumReduceTasks(int num);

推荐阅读： https://wiki.apache.org/hadoop/HowManyMapsAndReduces

默认减速器数量

Default number of reducers

hadoop

mapreduce

hdfs