生成两列和指示符之间的所有可能组合，以显示源中是否存在该组合 table

Question

我在转型的某个特定阶段完全迷失了。

我打算使用 SQL 或 pyspark 来实现它。

我的输入格式是。

id  name
1   A
1   C
1   E
2   A
2   B
2   C
2   E
2   F
3   A
3   E
3   D

你能帮我得到这个输出格式吗？

id name rating
1  A    1
1  B    0
1  C    1
1  D    0
1  E    1
1  F    0
2  A    1
2  B    1
2  C    1
2  D    0
2  E    1
2  F    1
3  A    1
3  B    0
3  C    0
3  D    1
3  E    1
3  F    0

由于 sql 查询一直在进行，我只想看看我是否可以使用 pyspark 将数据集馈送到 ALS 中来实现相同的效果。

换句话说，我如何生成 id 和 name 之间的所有可能组合，如果组合存在 table，则将评级设置为 1，否则为 0？

Answer 1

With other words generate all possible combinations between id and name.. if the combination exists with in table the rating is 1 otherwise 0?

您需要结合使用两个派生 table 和一个 CROSS JOIN 来获得所有可能的 ID 和名称组合。

查询

SELECT 
 *
FROM ( 

 SELECT 
   *
  FROM (
    SELECT
      DISTINCT
       id
    FROM
      Table1    
  ) AS distinct_id
  CROSS JOIN (
    SELECT 
      DISTINCT 
        name
    FROM 
    Table1 
  ) AS distinct_name
) AS table_combination

 ORDER BY 
    id ASC
  , name ASC

结果

| id | name |
|----|------|
|  1 |    A |
|  1 |    B |
|  1 |    C |
|  1 |    D |
|  1 |    E |
|  1 |    F |
|  2 |    A |
|  2 |    B |
|  2 |    C |
|  2 |    D |
|  2 |    E |
|  2 |    F |
|  3 |    A |
|  3 |    B |
|  3 |    C |
|  3 |    D |
|  3 |    E |
|  3 |    F |

查看演示 http://sqlfiddle.com/#!9/ba5f17/17

现在我们可以结合使用LEFT JOIN和CASE WHEN column IS NULL ... END来检查组合是否存在于当前table中或者组合是否已经生成。

查询

SELECT
   Table_combination.id
 , Table_combination.name
 , (
     CASE 
      WHEN Table1.id IS NULL
      THEN 0
      ELSE 1
     END
   ) AS rating
FROM ( 

  SELECT 
   *
  FROM (
    SELECT
      DISTINCT
       id
    FROM
      Table1    
  ) AS distinct_id
  CROSS JOIN (
    SELECT 
      DISTINCT 
        name
    FROM 
    Table1 
  ) AS distinct_name
) AS Table_combination

LEFT JOIN 
 Table1
ON
   Table_combination.id = Table1.id
 AND
   Table_combination.name = Table1.name

ORDER BY 
   Table_combination.id ASC
 , Table_combination.name ASC

结果

| id | name | rating |
|----|------|--------|
|  1 |    A |      1 |
|  1 |    B |      0 |
|  1 |    C |      1 |
|  1 |    D |      0 |
|  1 |    E |      1 |
|  1 |    F |      0 |
|  2 |    A |      1 |
|  2 |    B |      1 |
|  2 |    C |      1 |
|  2 |    D |      0 |
|  2 |    E |      1 |
|  2 |    F |      1 |
|  3 |    A |      1 |
|  3 |    B |      0 |
|  3 |    C |      0 |
|  3 |    D |      1 |
|  3 |    E |      1 |
|  3 |    F |      0 |

查看演示 http://sqlfiddle.com/#!9/ba5f17/13

Answer 2

我根据 Raymond Nijlands 的回答做了一个函数：

def expand_grid(df, df_name, col_a, col_b, col_c):
    df.createOrReplaceTempView(df_name)
    expand_sql = f"""
        SELECT
            expanded.{col_a},
            expanded.{col_b},
            CASE
                WHEN {df_name}.{col_c} IS NULL THEN 0
                ELSE {df_name}.{col_c}
            END AS {col_c}
        FROM ( 
            SELECT *
            FROM (
                SELECT DISTINCT {col_a}
                FROM {df_name}    
            ) AS {col_a}s
            CROSS JOIN (
                SELECT DISTINCT {col_b}
                FROM {df_name}
            ) AS {col_b}s
        ) AS expanded
        LEFT JOIN {df_name}
        ON expanded.{col_a} = {df_name}.{col_a}
        AND expanded.{col_b} = {df_name}.{col_b}
    """
    print(expand_sql)
    result = spark.sql(expand_sql)
    return result

在这个问题的上下文中的用法：

expand_grid(df=df, df_name="df_name", col_a="id", col_b="name", col_c="rating")

生成两列和指示符之间的所有可能组合，以显示源中是否存在该组合 table

Generate all possible combinations between two columns and an indicator to show if that combination exists in the source table

pyspark

pyspark-sql