在 Spark Structured Streaming JAVA 中合并两个具有不同列的数据集

Question

试图找出一种方法来合并两个不同的数据集以形成包含所有列的组合数据集。

Dataset<Row> dataActual = rowExtracted.selectExpr(
                "split(value,\"[|]\")[3] as sub_date",
                "split(value,\"[|]\")[7] as status",
                "split(value,\"[|]\")[14] as email_add",
                "split(value,\"[|]\")[15] as source_currency",
                "split(value,\"[|]\")[19] as processing_date"
        );


Dataset<Row> dataStatus = dataActual.select("status").map(
                (MapFunction<Row, String>)row-> mapStatus(row.toString()), 
                Encoders.STRING()).selectExpr("value as txn_latest_status").toDF();

尝试过使用 union 、 join 等，但没有任何效果

    Dataset<Row> data = dataActual.union(dataStatus);

实际：

Dataset 1 :
root
 |-- sub_date: string (nullable = true)
 |-- status: string (nullable = true)
 |-- email_add: string (nullable = true)
 |-- source_currency: string (nullable = true)
 |-- processing_date: string (nullable = true)

Dataset 2 :
root
 |-- txn_latest_status: string (nullable = true)

预期结果：组合数据集

root
 |-- sub_date: string (nullable = true)
 |-- status: string (nullable = true)
 |-- email_add: string (nullable = true)
 |-- source_currency: string (nullable = true)
 |-- processing_date: string (nullable = true)
 |-- txn_latest_status: string (nullable = true)

Answer 1

请在下面找到/

scala> res18.show
+-----+
|names|
+-----+
|    A|
|    B|
+-----+


scala> res19.show
+-------+
|numbers|
+-------+
|      1|
|      2|
+-------+
scala>res18.join(res19).show
+-----+-------+
|names|numbers|
+-----+-------+
|    A|      1|
|    A|      2|
|    B|      1|
|    B|      2|
+-----+-------+

在 Spark Structured Streaming JAVA 中合并两个具有不同列的数据集

Merge two Dataset with different column in Spark Structured Streaming JAVA

spark-structured-streaming