Pyspark：无论数组列中的顺序如何，都标记唯一数组

Question

我在 pyspark df 中有一个数组列，我想标记包含相同元素的唯一数组，而不考虑顺序，预期输出：

| Array Column | Label  |
| ------------ | ------ |
| [1,2,3]      | Group1 |
| [3,2,1]      | Group1 |
| [2,1,5]      | Group2 |
| [1,2,5]      | Group2 |
| [2,3,1]      | Group1 |

有人有什么想法吗？ array_except == empty 需要逐行比较，数据量很大。所以想知道pyspark是否还有其他解决方案。

Answer 1

数据

df= spark.createDataFrame([(1,[1,2,3]),     
    (2, [3,2,1] ),    
    (3, [2,1,5]),      
    (4, [1,2,5]) ,     
    (5, [2,3,1] ) ],
      ('id', 'array_column'))
    
    df.show()

解决方案

df= (df.withColumn('label', expr("array_join(array_sort(array_column),',')"))#Sort array elements, convert array by joining its elements to make string
  #.withColumn('label', dense_rank().over(Window.partitionBy().orderBy('label')))#dense rank to get order
  .withColumn('label', concat(lit('Group'),(dense_rank().over(Window.partitionBy().orderBy('label'))).cast('string')))#dense rank to get order

 
 .orderBy('id')#reorder
).show()

+---+------------+------+
| id|array_column| label|
+---+------------+------+
|  1|   [1, 2, 3]|Group1|
|  2|   [3, 2, 1]|Group1|
|  3|   [2, 1, 5]|Group2|
|  4|   [1, 2, 5]|Group2|
|  5|   [2, 3, 1]|Group1|
+---+------------+------+

Pyspark：无论数组列中的顺序如何，都标记唯一数组

Pyspark: labeling unique arrays regardless of order in array column

arrays

label

pyspark