如何获取列表列,其中包含 Pyspark Dataframe 中另一列中给出的多列值?
How to get a list column with values of multiple columns given in another column in Pyspark Dataframe?
有没有办法在 Pyspark 中创建像下面显示的数据框那样的新列?
我一直在尝试列表理解:
import pyspark.functions as F
df.withColumn('result', [F.col(colname) for colname in F.col('colList')])
但不起作用。
预期结果是:
+----+----+----+----+----+---------------+------+
|col1|col2|col3|col4|col5| colList|result|
+----+----+----+----+----+---------------+------+
| 1| 2| 0| 3| 4|['col1','col2']| [1,2]|
| 1| 2| 0| 3| 4|['col2','col3']| [2,0]|
| 1| 2| 0| 3| 4|['col1','col3']| [1,0]|
| 1| 2| 0| 3| 4|['col3','col4']| [0,3]|
| 1| 2| 0| 3| 4|['col2','col5']| [2,4]|
| 1| 2| 0| 3| 4|['col4','col5']| [3,4]|
+----+----+----+----+----+---------------+------+
# Loading requisite functions and creating the DataFrame
from pyspark.sql.functions import create_map, lit, col, struct
from itertools import chain
myValues = [(1,2,0,3,4,['col1','col2']),(1,2,0,3,4,['col2','col3']),
(1,2,0,3,4,['col1','col3']),(1,2,0,3,4,['col3','col4']),
(1,2,0,3,4,['col2','col5']),(1,2,0,3,4,['col4','col5'])]
df = sqlContext.createDataFrame(myValues,['col1','col2','col3','col4','col5','colList'])
df.show()
+----+----+----+----+----+------------+
|col1|col2|col3|col4|col5| colList|
+----+----+----+----+----+------------+
| 1| 2| 0| 3| 4|[col1, col2]|
| 1| 2| 0| 3| 4|[col2, col3]|
| 1| 2| 0| 3| 4|[col1, col3]|
| 1| 2| 0| 3| 4|[col3, col4]|
| 1| 2| 0| 3| 4|[col2, col5]|
| 1| 2| 0| 3| 4|[col4, col5]|
+----+----+----+----+----+------------+
下一步,我们为数组 colList 中的各个列创建列。
df = df.withColumn('first_col',col('colList')[0])
df = df.withColumn('second_col',col('colList')[1])
df.show()
+----+----+----+----+----+------------+---------+----------+
|col1|col2|col3|col4|col5| colList|first_col|second_col|
+----+----+----+----+----+------------+---------+----------+
| 1| 2| 0| 3| 4|[col1, col2]| col1| col2|
| 1| 2| 0| 3| 4|[col2, col3]| col2| col3|
| 1| 2| 0| 3| 4|[col1, col3]| col1| col3|
| 1| 2| 0| 3| 4|[col3, col4]| col3| col4|
| 1| 2| 0| 3| 4|[col2, col5]| col2| col5|
| 1| 2| 0| 3| 4|[col4, col5]| col4| col5|
+----+----+----+----+----+------------+---------+----------+
具有整数值的列列表 -
concerned_columns = [x for x in df.columns if x not in {'colList','first_col','second_col'}]
print(concerned_columns)
['col1', 'col2', 'col3', 'col4', 'col5']
现在,最重要的部分是,我们使用 create_map
函数创建列名与其各自值之间的映射,该函数已在 spark 2.+ 中提供。
# Maping - (column name, column values)
col_name_value_mapping = create_map(*chain.from_iterable(
(lit(c), col(c)) for c in concerned_columns
))
最后,应用此映射获取存储在列 first_col 和 second_col[=29= 中的列的值] 并使用 struct
.
将它们放入数组中
df = df.withColumn('result', struct(col_name_value_mapping[col('first_col')],col_name_value_mapping[col('second_col')]))
df = df.drop('first_col','second_col')
df.show()
+----+----+----+----+----+------------+------+
|col1|col2|col3|col4|col5| colList|result|
+----+----+----+----+----+------------+------+
| 1| 2| 0| 3| 4|[col1, col2]| [1,2]|
| 1| 2| 0| 3| 4|[col2, col3]| [2,0]|
| 1| 2| 0| 3| 4|[col1, col3]| [1,0]|
| 1| 2| 0| 3| 4|[col3, col4]| [0,3]|
| 1| 2| 0| 3| 4|[col2, col5]| [2,4]|
| 1| 2| 0| 3| 4|[col4, col5]| [3,4]|
+----+----+----+----+----+------------+------+
有没有办法在 Pyspark 中创建像下面显示的数据框那样的新列?
我一直在尝试列表理解:
import pyspark.functions as F
df.withColumn('result', [F.col(colname) for colname in F.col('colList')])
但不起作用。
预期结果是:
+----+----+----+----+----+---------------+------+
|col1|col2|col3|col4|col5| colList|result|
+----+----+----+----+----+---------------+------+
| 1| 2| 0| 3| 4|['col1','col2']| [1,2]|
| 1| 2| 0| 3| 4|['col2','col3']| [2,0]|
| 1| 2| 0| 3| 4|['col1','col3']| [1,0]|
| 1| 2| 0| 3| 4|['col3','col4']| [0,3]|
| 1| 2| 0| 3| 4|['col2','col5']| [2,4]|
| 1| 2| 0| 3| 4|['col4','col5']| [3,4]|
+----+----+----+----+----+---------------+------+
# Loading requisite functions and creating the DataFrame
from pyspark.sql.functions import create_map, lit, col, struct
from itertools import chain
myValues = [(1,2,0,3,4,['col1','col2']),(1,2,0,3,4,['col2','col3']),
(1,2,0,3,4,['col1','col3']),(1,2,0,3,4,['col3','col4']),
(1,2,0,3,4,['col2','col5']),(1,2,0,3,4,['col4','col5'])]
df = sqlContext.createDataFrame(myValues,['col1','col2','col3','col4','col5','colList'])
df.show()
+----+----+----+----+----+------------+
|col1|col2|col3|col4|col5| colList|
+----+----+----+----+----+------------+
| 1| 2| 0| 3| 4|[col1, col2]|
| 1| 2| 0| 3| 4|[col2, col3]|
| 1| 2| 0| 3| 4|[col1, col3]|
| 1| 2| 0| 3| 4|[col3, col4]|
| 1| 2| 0| 3| 4|[col2, col5]|
| 1| 2| 0| 3| 4|[col4, col5]|
+----+----+----+----+----+------------+
下一步,我们为数组 colList 中的各个列创建列。
df = df.withColumn('first_col',col('colList')[0])
df = df.withColumn('second_col',col('colList')[1])
df.show()
+----+----+----+----+----+------------+---------+----------+
|col1|col2|col3|col4|col5| colList|first_col|second_col|
+----+----+----+----+----+------------+---------+----------+
| 1| 2| 0| 3| 4|[col1, col2]| col1| col2|
| 1| 2| 0| 3| 4|[col2, col3]| col2| col3|
| 1| 2| 0| 3| 4|[col1, col3]| col1| col3|
| 1| 2| 0| 3| 4|[col3, col4]| col3| col4|
| 1| 2| 0| 3| 4|[col2, col5]| col2| col5|
| 1| 2| 0| 3| 4|[col4, col5]| col4| col5|
+----+----+----+----+----+------------+---------+----------+
具有整数值的列列表 -
concerned_columns = [x for x in df.columns if x not in {'colList','first_col','second_col'}]
print(concerned_columns)
['col1', 'col2', 'col3', 'col4', 'col5']
现在,最重要的部分是,我们使用 create_map
函数创建列名与其各自值之间的映射,该函数已在 spark 2.+ 中提供。
# Maping - (column name, column values)
col_name_value_mapping = create_map(*chain.from_iterable(
(lit(c), col(c)) for c in concerned_columns
))
最后,应用此映射获取存储在列 first_col 和 second_col[=29= 中的列的值] 并使用 struct
.
df = df.withColumn('result', struct(col_name_value_mapping[col('first_col')],col_name_value_mapping[col('second_col')]))
df = df.drop('first_col','second_col')
df.show()
+----+----+----+----+----+------------+------+
|col1|col2|col3|col4|col5| colList|result|
+----+----+----+----+----+------------+------+
| 1| 2| 0| 3| 4|[col1, col2]| [1,2]|
| 1| 2| 0| 3| 4|[col2, col3]| [2,0]|
| 1| 2| 0| 3| 4|[col1, col3]| [1,0]|
| 1| 2| 0| 3| 4|[col3, col4]| [0,3]|
| 1| 2| 0| 3| 4|[col2, col5]| [2,4]|
| 1| 2| 0| 3| 4|[col4, col5]| [3,4]|
+----+----+----+----+----+------------+------+