火花测试：值得吗？（最佳实践）

Spark Testing: is it worth? (Best Practices)

使用 Apache Spark，我想知道它是否真的对生产测试有价值，如果在哪个级别。

The business logic in your pipelines will likely change as well as the input data. Even more importantly, you want to be sure that what you’re deducing from the raw data is what you actually think that you’re deducing. This means that you’ll need to do robust logical testing with realistic data to ensure that you’re actually getting what you want out of it.

建议引入某种测试。

但让我印象深刻的是：

One thing to be wary of here is trying to write a bunch of “Spark Unit Tests” that just test Spark’s functionality. You don’t want to be doing that; instead, you want to be testing your business logic and ensuring that the complex business pipeline that you set up is actually doing what you think it should be doing.

其中概述了本书作者不鼓励进行单元测试（如果我误解了请纠正我）。

可能值得测试的是通过 Spark 应用的数据转换的逻辑。

再次摘自本书：

First, you might maintain a scratch space, such as an interactive notebook or some equivalent thereof, and then as you build key components and algorithms, you move them to a more permanent location like a library or package. The notebook experience is one that we often recommend (and are using to write this book) because of its simplicity in experimentation

这建议在交互式环境中测试您的数据转换逻辑，例如笔记本（例如用于 Pyspark 的 Jupyter 笔记本）。基本上您可以直接看到转换产生的结果。

所以我想请教比我有经验的人，你同意书中引用的观点吗？（或者我误解了）它们可以用作该领域的一种最佳实践吗？（例如避免单元测试，而是促进更高级别的测试，如逻辑/集成测试）

声明并不是说要避免单元测试。它说要避免对业务没有价值的测试数据，否则你将最终测试 spark api 而不是你的业务组件。例如，您在 spark UDF 中编写了一个函数来进行聚合，因此在编写单元测试时，请确保为您的函数提供模拟生产环境的真实数据。

借助像 zeepline 这样的笔记本体验，您可以将所有阶段集中在一个地方，例如数据摄取、可视化。它与数据管道真正互动

火花测试：值得吗？（最佳实践）

Spark Testing: is it worth? (Best Practices)

testing

analytics

automated-tests

apache-spark

火花测试：值得吗？ （最佳实践）

Spark Testing: is it worth? (Best Practices)

testing

analytics

automated-tests

apache-spark

火花测试：值得吗？（最佳实践）