Pyspark 交叉连接

Question

比方说，我的数据框有值

col1 

|1|

|2|

|3|

所以我想创建一个具有值的新 pyspark 数据框

|1x1|

|1x2|

|1x3|

|2x1|

|2x2|

|2x3|

|3x1|

|3x2|

|3x3|

谁能告诉我如何制作这个数据框？ 我现在正在使用 crossJoin 但它显示错误

Answer 1

试试这个：

from pyspark.sql import functions as F

df = spark.range(10)
df.show()
# +---+
# | id|
# +---+
# |  0|
# |  1|
# |  2|
# |  3|
# |  4|
# |  5|
# |  6|
# |  7|
# |  8|
# |  9|
# +---+

df_1 = df.alias("df1")
df_2 = df.alias("df2")
df_cross = df_1.crossJoin(df_2)

df_cross.show()
# +---+---+
# | id| id|
# +---+---+
# |  0|  0|
# |  0|  1|
# |  0|  2|
# |  0|  3|
# |  0|  4|
# |  1|  0|
# |  1|  1|
# |  1|  2|
# |  1|  3|
# |  1|  4|
# |  2|  0|
# |  2|  1|
# |  2|  2|
# |  2|  3|
# |  2|  4|
# |  3|  0|
# |  3|  1|
# |  3|  2|
# |  3|  3|
# |  3|  4|
# +---+---+
# only showing top 20 rows


df_cross = df_cross.withColumn(
    "concat", F.concat_ws("x", F.col("df1.id"), F.col("df2.id"))
)

df_cross.show()
# +---+---+------+
# | id| id|concat|
# +---+---+------+
# |  0|  0|   0x0|
# |  0|  1|   0x1|
# |  0|  2|   0x2|
# |  0|  3|   0x3|
# |  0|  4|   0x4|
# |  1|  0|   1x0|
# |  1|  1|   1x1|
# |  1|  2|   1x2|
# |  1|  3|   1x3|
# |  1|  4|   1x4|
# |  2|  0|   2x0|
# |  2|  1|   2x1|
# |  2|  2|   2x2|
# |  2|  3|   2x3|
# |  2|  4|   2x4|
# |  3|  0|   3x0|
# |  3|  1|   3x1|
# |  3|  2|   3x2|
# |  3|  3|   3x3|
# |  3|  4|   3x4|
# +---+---+------+
# only showing top 20 rows

Pyspark 交叉连接

Pyspark crossjoining

pyspark

data-science