如何扩展 apache spark api？

Question

我的任务是弄清楚如何扩展 spark 的 api 以包含一些自定义挂钩，供 iPython Notebook 等另一个程序使用。我已经完成了 quick start guide, the cluster mode overview, the submitting applications doc, and this stack overflow question。我所看到的一切都表明，要在 Spark 中获得运行的内容，您需要使用

spark-submit

让它发生。因此，我编写了一些代码，visa vis spark，从我创建的 accumulo table 中提取了前十行测试数据。但是，我的团队负责人告诉我修改 spark 本身。这是完成我描述的任务的首选方式吗？如果是这样，为什么？价值主张是什么？

Answer 1

没有提供有关您的应用程序需要什么类型的操作的详细信息，因此此处的答案需要保持一般性。

扩展 spark 本身的问题可能归结为：

Can I achieve the needs of the application by leveraging the existing methods within Spark(/SQL/Hive/Streaming)Context and RDD (/SchemaRDD/DStream/..)

其他选择：

Is it possible to embed the required functionality inside the transformation methods of RDD's - either by custom code or by invoking third party libraries.

此处可能的区别因素是现有数据访问和 shuffle/distribution 结构是否支持您的需求。当谈到数据转换时——在大多数情况下，您应该能够将所需的逻辑嵌入到 RDD 的方法中。

所以：

case class InputRecord(..)
case class OutputRecord(..)
def myTranformationLogic(inputRec: InputRecord) : OutputRecord = {
  // put your biz rules/transforms here
  (return) outputRec
}
val myData = sc.textFile(<hdfs path>).map{ l => InputRecord.fromInputLine(l)}
val outputData = myData.map(myTransformationLogic)
outputData.saveAsTextFile(<hdfs path>)

如何扩展 apache spark api？

how to extend apache spark api?

apache-spark