为什么使用 where 条件计数查询需要改组数据？

why count query with where condition needs shuffling of data?

通过更改参数 "spark.sql.shuffle.partitions"，以下查询的性能有所不同。以下查询是否需要改组？

Select count(*) from table where id is not null

我的另一个疑问是，下图中两个阶段之间的界线是什么。是洗牌的意义吗？

在第一阶段，所有任务计算单个任务级别计数，然后聚合计数已转移到第二阶段，第二阶段将所有计数相加并给出最终计数。