org.apache.spark.ml.classification 和 org.apache.spark.mllib.classification 之间的区别

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

我正在编写一个 spark 应用程序并且想在 MLlib 中使用算法。在 API 文档中,我为同一算法找到了两个不同的 类。

例如org.apache.spark.ml.classification中有一个LogisticRegression,org.apache.spark.mllib.classification中有一个LogisticRegressionwithSGD。

我能找到的唯一区别是 org.apache.spark.ml 中的那个是从 Estimator 继承的,并且能够用于交叉验证。我很困惑它们被放在不同的包里。有没有人知道它的原因?谢谢!

JIRA ticket

Design Doc:

MLlib now covers a basic selection of machine learning algorithms, e.g., logistic regression, decision trees, alternating least squares, and k-means. The current set of APIs contains several design flaws that prevent us moving forward to address practical machine learning pipelines, make MLlib itself a scalable project.

The new set of APIs will live under org.apache.spark.ml, and o.a.s.mllib will be deprecated once we migrate all features to o.a.s.ml.

spark mllib guide 说:

spark.mllib contains the original API built on top of RDDs.

spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.

我觉得文档解释的很好。