带有列表列表的数据框如何将每一行分解为列-pyspark
How can dataframe with list of lists can be explode each line as columns - pyspark
我有如下数据框
--------------------+
| pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+
想要从以上 4 行的每一行中提取几列并创建如下所示的数据框。列名应该来自架构而不是像 column1 和 column2 这样的硬编码。
---------|-----------+
| gender | givenName |
+--------|-----------+
| a | b |
| a | b |
| a | b |
| a | b |
+--------------------+
pas1 - schema
root
|-- pas1: struct (nullable = true)
| |-- contactList: struct (nullable = true)
| | |-- contact: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contactTypeCode: string (nullable = true)
| | | | |-- contactMediumTypeCode: string (nullable = true)
| | | | |-- contactTypeID: string (nullable = true)
| | | | |-- lastUpdateTimestamp: string (nullable = true)
| | | | |-- contactInformation: string (nullable = true)
| |-- dateOfBirth: string (nullable = true)
| |-- farePassengerTypeCode: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- givenName: string (nullable = true)
| |-- groupDepositIndicator: string (nullable = true)
| |-- infantIndicator: string (nullable = true)
| |-- lastUpdateTimestamp: string (nullable = true)
| |-- passengerFOPList: struct (nullable = true)
| | |-- passengerFOP: struct (nullable = true)
| | | |-- fopID: string (nullable = true)
| | | |-- lastUpdateTimestamp: string (nullable = true)
| | | |-- fopFreeText: string (nullable = true)
| | | |-- fopSupplementaryInfoList: struct (nullable = true)
| | | | |-- fopSupplementaryInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
感谢帮助
如果你想从包含结构的数据框中提取几列,你可以简单地做这样的事情:
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()
df.select('pas1.gender','pas1.givenName').show()
相反,如果您想展平数据框,这个问题应该对您有所帮助:
我有如下数据框
--------------------+
| pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+
想要从以上 4 行的每一行中提取几列并创建如下所示的数据框。列名应该来自架构而不是像 column1 和 column2 这样的硬编码。
---------|-----------+
| gender | givenName |
+--------|-----------+
| a | b |
| a | b |
| a | b |
| a | b |
+--------------------+
pas1 - schema
root
|-- pas1: struct (nullable = true)
| |-- contactList: struct (nullable = true)
| | |-- contact: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contactTypeCode: string (nullable = true)
| | | | |-- contactMediumTypeCode: string (nullable = true)
| | | | |-- contactTypeID: string (nullable = true)
| | | | |-- lastUpdateTimestamp: string (nullable = true)
| | | | |-- contactInformation: string (nullable = true)
| |-- dateOfBirth: string (nullable = true)
| |-- farePassengerTypeCode: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- givenName: string (nullable = true)
| |-- groupDepositIndicator: string (nullable = true)
| |-- infantIndicator: string (nullable = true)
| |-- lastUpdateTimestamp: string (nullable = true)
| |-- passengerFOPList: struct (nullable = true)
| | |-- passengerFOP: struct (nullable = true)
| | | |-- fopID: string (nullable = true)
| | | |-- lastUpdateTimestamp: string (nullable = true)
| | | |-- fopFreeText: string (nullable = true)
| | | |-- fopSupplementaryInfoList: struct (nullable = true)
| | | | |-- fopSupplementaryInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
感谢帮助
如果你想从包含结构的数据框中提取几列,你可以简单地做这样的事情:
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()
df.select('pas1.gender','pas1.givenName').show()
相反,如果您想展平数据框,这个问题应该对您有所帮助: