Spark Scala:访问数组内部结构中的数据
Spark Scala: Access data inside struct which is inside of an array
架构如下所示
root
|-- orderitemlist: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- internal-material-code: string (nullable = true)
| | |-- lot-number: string (nullable = true)
| | |-- packaging-item-code: string (nullable = true)
| | |-- packaging-item-code-type: string (nullable = true)
如何访问 internal-material-code
和 lot-number
的值
我在创建数据框时这样做
df.withColumn("internalmaterialcode", col("orderitemlist")(0).getItem("internal-material-code"))
还有
df.withColumn("internalmaterialcode", col("orderitemlist")(0)("internal-material-code"))
又如下
df.withColumn("orderitemlistarray", explode(col("orderitemlist")))
.withColumn("internalmaterialcode", col("orderitemlistarray").getItem("internal-material-code"))
又如下
df.withColumn("orderitemlistarray", explode(col("orderitemlist")))
.withColumn("internalmaterialcode", col("orderitemlistarray.internal-material-code"))
但它给出 null
我在 Whosebug 问题上看到了类似的模式,但 none 的答案对我很有用。有人可以回答或指引我到正确的地方吗?
在 explode
之后,select
新创建的列,它将提供来自 struct fields
的所有数据。
Example:
val va="""{
"orderitemlist": [{
"internal-material-code": "123",
"lot-number": "vv",
"packaging-item-code": "pp",
"packaging-item-code-type": "ll"
},{
"internal-material-code": "234",
"lot-number": "vv",
"packaging-item-code": "pp",
"packaging-item-code-type": "ll"
}]
}"""
val df=spark.read.json(Seq(va).toDS).toDF
df.withColumn("arr",explode(col("orderitemlist"))).select("arr.*").show()
Result:
+----------------------+----------+-------------------+------------------------+
|internal-material-code|lot-number|packaging-item-code|packaging-item-code-type|
+----------------------+----------+-------------------+------------------------+
| 123| vv| pp| ll|
| 234| vv| pp| ll|
+----------------------+----------+-------------------+------------------------+
现在您将从数组中的结构中获取所有列..!!
我检查了您共享的代码块,它工作正常。请在这里完成我的工作(作为对早期解决方案的扩展):
>>>df.withColumn("ves", $"orderitemlist.lot-number").show
+--------------------+--------+
| orderitemlist| ves|
+--------------------+--------+
|[[123, vv, pp, ll...|[vv, vv]|
+--------------------+--------+
>>>df.withColumn("vew", $"orderitemlist".getItem("lot-number")).show
+--------------------+--------+
| orderitemlist| vew|
+--------------------+--------+
|[[123, vv, pp, ll...|[vv, vv]|
+--------------------+--------+
架构如下所示
root
|-- orderitemlist: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- internal-material-code: string (nullable = true)
| | |-- lot-number: string (nullable = true)
| | |-- packaging-item-code: string (nullable = true)
| | |-- packaging-item-code-type: string (nullable = true)
如何访问 internal-material-code
和 lot-number
我在创建数据框时这样做
df.withColumn("internalmaterialcode", col("orderitemlist")(0).getItem("internal-material-code"))
还有
df.withColumn("internalmaterialcode", col("orderitemlist")(0)("internal-material-code"))
又如下
df.withColumn("orderitemlistarray", explode(col("orderitemlist")))
.withColumn("internalmaterialcode", col("orderitemlistarray").getItem("internal-material-code"))
又如下
df.withColumn("orderitemlistarray", explode(col("orderitemlist")))
.withColumn("internalmaterialcode", col("orderitemlistarray.internal-material-code"))
但它给出 null
我在 Whosebug 问题上看到了类似的模式,但 none 的答案对我很有用。有人可以回答或指引我到正确的地方吗?
在 explode
之后,select
新创建的列,它将提供来自 struct fields
的所有数据。
Example:
val va="""{
"orderitemlist": [{
"internal-material-code": "123",
"lot-number": "vv",
"packaging-item-code": "pp",
"packaging-item-code-type": "ll"
},{
"internal-material-code": "234",
"lot-number": "vv",
"packaging-item-code": "pp",
"packaging-item-code-type": "ll"
}]
}"""
val df=spark.read.json(Seq(va).toDS).toDF
df.withColumn("arr",explode(col("orderitemlist"))).select("arr.*").show()
Result:
+----------------------+----------+-------------------+------------------------+
|internal-material-code|lot-number|packaging-item-code|packaging-item-code-type|
+----------------------+----------+-------------------+------------------------+
| 123| vv| pp| ll|
| 234| vv| pp| ll|
+----------------------+----------+-------------------+------------------------+
现在您将从数组中的结构中获取所有列..!!
我检查了您共享的代码块,它工作正常。请在这里完成我的工作(作为对早期解决方案的扩展):
>>>df.withColumn("ves", $"orderitemlist.lot-number").show
+--------------------+--------+
| orderitemlist| ves|
+--------------------+--------+
|[[123, vv, pp, ll...|[vv, vv]|
+--------------------+--------+
>>>df.withColumn("vew", $"orderitemlist".getItem("lot-number")).show
+--------------------+--------+
| orderitemlist| vew|
+--------------------+--------+
|[[123, vv, pp, ll...|[vv, vv]|
+--------------------+--------+