将字符串数组转换为 spark 中的结构数组 java
Convert array of string to array of struct in spark java
我有一堆 json 以下格式的数据
{"name": "Michael", "age": "30", "producta1": "blah1", "producta3": "blah2"}
{"name": "Michael", "age": "30", "producta1": "blah3", "producta3": "blah4"}
{"name": "Michael", "age": "30", "producta1": "blah5", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah5", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah6", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah7", "producta3": "blah6"}
{"name": "Justin", "age": "12", "producta1": "blah5", "producta3": "blah6"}
{"name": "Justin", "age": "12", "producta1": "blah5", "producta3": "blah6"}
我下面的代码在 spark 中,我做了如下某种聚合
Dataset<Row> df = sc.read().json("/Users/g.bhageshpur/Downloads/spark-master/examples/src/main/examples/src/main/resources/people.json");
df.createOrReplaceTempView("people");
Dataset<Row> sqlDf = sc.sql("SELECT * FROM people");
Dataset<Row> groupby = sqlDf.groupBy(new Column("name"), new Column("age"))
.agg(org.apache.spark.sql.functions.collect_list("producta1"),
org.apache.spark.sql.functions.collect_list("producta3"))
.toDF("name","age","producta1","producta2");
上面的代码给出了类似于
的输出
+-------+---+--------------------+--------------------+
| name|age| producta1| producta2|
+-------+---+--------------------+--------------------+
| Andy| 28|[blah5, blah6, bl...|[blah6, blah6, bl...|
| Justin| 12| [blah5, blah6]| [blah6, blah6]|
|Michael| 30|[blah1, blah3, bl...|[blah2, blah4, bl...|
+-------+---+--------------------+--------------------+
我有一个要求,我需要将 producta1 列中的上述数组值转换为 json 对象数组,类似于
[{"producta1": "blah5"},{"producta1": "blah6"},{"producta1": "blah7"}]
[{"producta1": "blah1"},{"producta1": "blah3"},{"producta1": "blah5"}]
我尝试了类似于
的东西
groupby.withColumn("newcolumn", functions.to_json(struct("producta1")));
上面的代码片段没有给我想要的结果。如何在 spark java 中实现 Json 对象数组?
试试下面的代码。
df
.groupBy($"name",$"age")
.agg(
collect_list(to_json(struct("producta1"))).as("producta1"), // use to_json & struct functions here.
collect_list(to_json(struct($"producta3"))).as("producta3") // use to_json & struct functions here.
).show(false)
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+
|name |age|producta1 |producta3 |
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+
|Andy |28 |[{"producta1":"blah5"}, {"producta1":"blah6"}, {"producta1":"blah7"}]|[{"producta3":"blah6"}, {"producta3":"blah6"}, {"producta3":"blah6"}]|
|Justin |12 |[{"producta1":"blah5"}, {"producta1":"blah5"}] |[{"producta3":"blah6"}, {"producta3":"blah6"}] |
|Michael|30 |[{"producta1":"blah1"}, {"producta1":"blah3"}, {"producta1":"blah5"}]|[{"producta3":"blah2"}, {"producta3":"blah4"}, {"producta3":"blah6"}]|
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+
我有一堆 json 以下格式的数据
{"name": "Michael", "age": "30", "producta1": "blah1", "producta3": "blah2"}
{"name": "Michael", "age": "30", "producta1": "blah3", "producta3": "blah4"}
{"name": "Michael", "age": "30", "producta1": "blah5", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah5", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah6", "producta3": "blah6"}
{"name": "Andy", "age": "28", "producta1": "blah7", "producta3": "blah6"}
{"name": "Justin", "age": "12", "producta1": "blah5", "producta3": "blah6"}
{"name": "Justin", "age": "12", "producta1": "blah5", "producta3": "blah6"}
我下面的代码在 spark 中,我做了如下某种聚合
Dataset<Row> df = sc.read().json("/Users/g.bhageshpur/Downloads/spark-master/examples/src/main/examples/src/main/resources/people.json");
df.createOrReplaceTempView("people");
Dataset<Row> sqlDf = sc.sql("SELECT * FROM people");
Dataset<Row> groupby = sqlDf.groupBy(new Column("name"), new Column("age"))
.agg(org.apache.spark.sql.functions.collect_list("producta1"),
org.apache.spark.sql.functions.collect_list("producta3"))
.toDF("name","age","producta1","producta2");
上面的代码给出了类似于
的输出+-------+---+--------------------+--------------------+
| name|age| producta1| producta2|
+-------+---+--------------------+--------------------+
| Andy| 28|[blah5, blah6, bl...|[blah6, blah6, bl...|
| Justin| 12| [blah5, blah6]| [blah6, blah6]|
|Michael| 30|[blah1, blah3, bl...|[blah2, blah4, bl...|
+-------+---+--------------------+--------------------+
我有一个要求,我需要将 producta1 列中的上述数组值转换为 json 对象数组,类似于
[{"producta1": "blah5"},{"producta1": "blah6"},{"producta1": "blah7"}]
[{"producta1": "blah1"},{"producta1": "blah3"},{"producta1": "blah5"}]
我尝试了类似于
的东西groupby.withColumn("newcolumn", functions.to_json(struct("producta1")));
上面的代码片段没有给我想要的结果。如何在 spark java 中实现 Json 对象数组?
试试下面的代码。
df
.groupBy($"name",$"age")
.agg(
collect_list(to_json(struct("producta1"))).as("producta1"), // use to_json & struct functions here.
collect_list(to_json(struct($"producta3"))).as("producta3") // use to_json & struct functions here.
).show(false)
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+
|name |age|producta1 |producta3 |
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+
|Andy |28 |[{"producta1":"blah5"}, {"producta1":"blah6"}, {"producta1":"blah7"}]|[{"producta3":"blah6"}, {"producta3":"blah6"}, {"producta3":"blah6"}]|
|Justin |12 |[{"producta1":"blah5"}, {"producta1":"blah5"}] |[{"producta3":"blah6"}, {"producta3":"blah6"}] |
|Michael|30 |[{"producta1":"blah1"}, {"producta1":"blah3"}, {"producta1":"blah5"}]|[{"producta3":"blah2"}, {"producta3":"blah4"}, {"producta3":"blah6"}]|
+-------+---+---------------------------------------------------------------------+---------------------------------------------------------------------+