spark: Aggregator 和 UDAF 有什么区别？

spark: What is the difference between Aggregator and UDAF？

在 Spark 的文档中，聚合器：

abstract class Aggregator[-IN, BUF, OUT] extends Serializable

A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.

UserDefinedAggregateFunction 是：

abstract class UserDefinedAggregateFunction extends Serializable

The base class for implementing user-defined aggregate functions (UDAF).

根据 Dataset Aggregator - Databricks，“Aggregator 类似于 UDAF，但接口是用 JVM 对象而不是 Row 表示的。”

看起来这两个类很相似，除了接口类型不同，还有哪些区别？

类似的问题是：Performance of UDAF versus Aggregator in Spark

除了类型之外，一个根本区别是外部接口：

Aggregator 需要一个完整的 Row（它适用于 "strongly" 输入的 API）。
UserDefinedAggregationFunction取一组Columns.

这使得 Aggregator 不太灵活，尽管总体上 API 对用户更友好。

处理状态也有区别：

Aggregator 是有状态的。取决于其缓冲区字段的可变内部状态。
UserDefinedAggregateFunction 是无状态的。缓冲区的状态是外部的。

spark: Aggregator 和 UDAF 有什么区别？

spark: What is the difference between Aggregator and UDAF？

aggregate

apache-spark

apache-spark-sql

spark-dataframe