Apache Spark 与 Spring 云数据流

Apache Spark vs Spring Cloud data flow

我是大数据处理的新手，正在阅读有关流处理和构建数据管道的工具。我找到了 Apache Spark 和 Spring Cloud Data Flow。我想知道它们的主要区别和优缺点。有人可以帮助我吗？

它们是两种完全不同的工具。

Spring Data Flow 是一个用于构建数据集成和实时数据处理管道的工具包。此工具将帮助您使用 Spring 启动应用程序（流或任务）编排数据管道。在幕后，SCDF 可能会使用 Spring Batch。注意这个 Spring Boot Apps 可以调用 Spark 或 Kafka 应用程序来支持 Stream 处理。

Apache Spark is an engine for data processing, it is being highly used for data intensive processing and data science. It has libraries such as ML (Machine Learning), Graph (graph processing), integration with Apache Kafka (Spark Streaming)，等等。

对于流式处理，我强烈建议您学习 Apache Kafka。

如https://dataflow.spring.io/docs/concepts/architecture/#comparison-to-other-architectures

所述

Comparison to Other Architectures

Spring Cloud Data Flow’s architectural style is different than other Stream and Batch processing platforms. For example in Apache Spark, Apache Flink, and Google Cloud Dataflow, applications run on a dedicated compute engine cluster. The nature of the compute engine gives these platforms a richer environment for performing complex calculations on the data as compared to Spring Cloud Data Flow, but it introduces the complexity of another execution environment that is often not needed when creating data-centric applications. That does not mean that you cannot do real-time data computations when you use Spring Cloud Data Flow. For example, you can develop applications that use the Kafka Streams API that time-sliding-window and moving-average functionality as well as joins of the incoming messages against sets of reference data.

Apache Spark 与 Spring 云数据流

Apache Spark vs Spring Cloud data flow

apache-spark

spring-cloud-dataflow