如何加入两个数据框?
How to join two dataframes?
我无法使 Sparks DataFrame 连接正常工作(未生成任何结果)。这是我的代码:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
需要在顶层定义case class Edge(start: Int, end: Int)
。这是它产生的输出:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
我不明白为什么输出是空的。为什么 filtered
的第一行不与 edges
的最后一行合并?
val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
我认为这是因为 filtered("start").equals(edges("start"))
,即 filtered
是边上的过滤视图,它们共享列定义。列相同,因此 Spark 无法理解您引用的内容。
因此你可以做类似的事情
edges.select(filtered("start")).show
我无法使 Sparks DataFrame 连接正常工作(未生成任何结果)。这是我的代码:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
需要在顶层定义case class Edge(start: Int, end: Int)
。这是它产生的输出:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
我不明白为什么输出是空的。为什么 filtered
的第一行不与 edges
的最后一行合并?
val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
我认为这是因为 filtered("start").equals(edges("start"))
,即 filtered
是边上的过滤视图,它们共享列定义。列相同,因此 Spark 无法理解您引用的内容。
因此你可以做类似的事情
edges.select(filtered("start")).show