如何在 Scala 中构造一个包含给定 DataFrame 内容的字符串

Question

假设我有一个数据框。如何检索该数据框的内容并将其表示为字符串。

考虑一下我尝试使用下面的示例代码来做到这一点。

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
  println("x = ", x)
  sb.append(x)
})
println("sb = ", sb)

代码的输出显示示例数据框包含以下内容：

(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))

然而，最后的 stringbuilder 包含一个空字符串。

有没有想过如何在 Scala 中检索给定 dataframe 的字符串？

非常感谢

Answer 1

UPD：如@user8371915 所述，以下解决方案仅适用于开发中的单个 JVM（本地）mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars 。希望对你有所帮助。

我认为您应该阅读有关 Spark 中的共享变量的主题。 Link here

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

让我们看看广播变量。我编辑了你的代码：

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
  println("x = ", x)
  broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)

这里我使用 broadcastVar 作为 StringBuilder 变量 sb 的容器。这是输出：

(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))

希望对您有所帮助。

Answer 2

df.show(false) 的输出有帮助吗？如果是，那么这个 SO 答案会有所帮助：

Answer 3

感谢大家的反馈，感谢大家更好地理解这一点。

响应的组合结果如下。要求略有变化，因为我将我的 df 表示为 json 列表。下面的代码在不使用广播的情况下执行此操作。

class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
  val jsons = df.limit(limit).collect.map(rowToJson(_))

  def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
    try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
    catch { case t: Throwable =>
        JSONObject.apply(Map("Row with error" -> t.toString))
    }
  }
}

我这里用的class...

val jsons = new HandleDf(df, 100).jsons

如何在 Scala 中构造一个包含给定 DataFrame 内容的字符串

How can I construct a String with the contents of a given DataFrame in Scala

scala

apache-spark

spark-dataframe