在 Apache Spark 代码的调用方法中使用 memSQL Connection 对象的正确方法是什么

Question

我有一个 spark 代码，其中 Call 方法中的代码调用 memSQL 数据库以从 table 中读取数据。我的代码每次都会打开一个新的连接对象，并在任务完成后将其关闭。此调用是从 Call 方法内部进行的。这工作正常，但 Spark 作业的执行时间变长了。有什么更好的方法可以减少 spark 代码的执行时间。

谢谢。

Answer 1

每个分区可以使用一个连接，如下所示：

rdd.foreachPartition {records =>
  val connection = DB.createConnection()
  //you can use your connection instance inside foreach
  records.foreach { r=>
    val externalData = connection.read(r.externaId)
    //do something with your data
  }
  DB.save(records)
  connection.close()
}

如果您使用 Spark Streaming：

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { records =>
    val connection = DB.createConnection()
    //you can use your connection instance inside foreach
    records.foreach { r=>
      val externalData = connection.read(r.externaId)
      //do something with your data
    }
    DB.save(records)
    connection.close()
  }
}

见http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

在 Apache Spark 代码的调用方法中使用 memSQL Connection 对象的正确方法是什么

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

singlestore

apache-spark

spark-streaming