生成两列和指示符之间的所有可能组合,以显示源中是否存在该组合 table
Generate all possible combinations between two columns and an indicator to show if that combination exists in the source table
我在转型的某个特定阶段完全迷失了。
我打算使用 SQL 或 pyspark 来实现它。
我的输入格式是。
id name
1 A
1 C
1 E
2 A
2 B
2 C
2 E
2 F
3 A
3 E
3 D
你能帮我得到这个输出格式吗?
id name rating
1 A 1
1 B 0
1 C 1
1 D 0
1 E 1
1 F 0
2 A 1
2 B 1
2 C 1
2 D 0
2 E 1
2 F 1
3 A 1
3 B 0
3 C 0
3 D 1
3 E 1
3 F 0
由于 sql 查询一直在进行,我只想看看我是否可以使用 pyspark 将数据集馈送到 ALS 中来实现相同的效果。
换句话说,我如何生成 id 和 name 之间的所有可能组合,如果组合存在 table,则将评级设置为 1,否则为 0?
With other words generate all possible combinations between id and
name.. if the combination exists with in table the rating is 1
otherwise 0?
您需要结合使用两个派生 table 和一个 CROSS JOIN
来获得所有可能的 ID 和名称组合。
查询
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS table_combination
ORDER BY
id ASC
, name ASC
结果
| id | name |
|----|------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 1 | F |
| 2 | A |
| 2 | B |
| 2 | C |
| 2 | D |
| 2 | E |
| 2 | F |
| 3 | A |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
查看演示 http://sqlfiddle.com/#!9/ba5f17/17
现在我们可以结合使用LEFT JOIN
和CASE WHEN column IS NULL ... END
来检查组合是否存在于当前table中或者组合是否已经生成。
查询
SELECT
Table_combination.id
, Table_combination.name
, (
CASE
WHEN Table1.id IS NULL
THEN 0
ELSE 1
END
) AS rating
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS Table_combination
LEFT JOIN
Table1
ON
Table_combination.id = Table1.id
AND
Table_combination.name = Table1.name
ORDER BY
Table_combination.id ASC
, Table_combination.name ASC
结果
| id | name | rating |
|----|------|--------|
| 1 | A | 1 |
| 1 | B | 0 |
| 1 | C | 1 |
| 1 | D | 0 |
| 1 | E | 1 |
| 1 | F | 0 |
| 2 | A | 1 |
| 2 | B | 1 |
| 2 | C | 1 |
| 2 | D | 0 |
| 2 | E | 1 |
| 2 | F | 1 |
| 3 | A | 1 |
| 3 | B | 0 |
| 3 | C | 0 |
| 3 | D | 1 |
| 3 | E | 1 |
| 3 | F | 0 |
我根据 Raymond Nijlands 的回答做了一个函数:
def expand_grid(df, df_name, col_a, col_b, col_c):
df.createOrReplaceTempView(df_name)
expand_sql = f"""
SELECT
expanded.{col_a},
expanded.{col_b},
CASE
WHEN {df_name}.{col_c} IS NULL THEN 0
ELSE {df_name}.{col_c}
END AS {col_c}
FROM (
SELECT *
FROM (
SELECT DISTINCT {col_a}
FROM {df_name}
) AS {col_a}s
CROSS JOIN (
SELECT DISTINCT {col_b}
FROM {df_name}
) AS {col_b}s
) AS expanded
LEFT JOIN {df_name}
ON expanded.{col_a} = {df_name}.{col_a}
AND expanded.{col_b} = {df_name}.{col_b}
"""
print(expand_sql)
result = spark.sql(expand_sql)
return result
在这个问题的上下文中的用法:
expand_grid(df=df, df_name="df_name", col_a="id", col_b="name", col_c="rating")
我在转型的某个特定阶段完全迷失了。
我打算使用 SQL 或 pyspark 来实现它。
我的输入格式是。
id name
1 A
1 C
1 E
2 A
2 B
2 C
2 E
2 F
3 A
3 E
3 D
你能帮我得到这个输出格式吗?
id name rating
1 A 1
1 B 0
1 C 1
1 D 0
1 E 1
1 F 0
2 A 1
2 B 1
2 C 1
2 D 0
2 E 1
2 F 1
3 A 1
3 B 0
3 C 0
3 D 1
3 E 1
3 F 0
由于 sql 查询一直在进行,我只想看看我是否可以使用 pyspark 将数据集馈送到 ALS 中来实现相同的效果。
换句话说,我如何生成 id 和 name 之间的所有可能组合,如果组合存在 table,则将评级设置为 1,否则为 0?
With other words generate all possible combinations between id and name.. if the combination exists with in table the rating is 1 otherwise 0?
您需要结合使用两个派生 table 和一个 CROSS JOIN
来获得所有可能的 ID 和名称组合。
查询
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS table_combination
ORDER BY
id ASC
, name ASC
结果
| id | name |
|----|------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 1 | F |
| 2 | A |
| 2 | B |
| 2 | C |
| 2 | D |
| 2 | E |
| 2 | F |
| 3 | A |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
查看演示 http://sqlfiddle.com/#!9/ba5f17/17
现在我们可以结合使用LEFT JOIN
和CASE WHEN column IS NULL ... END
来检查组合是否存在于当前table中或者组合是否已经生成。
查询
SELECT
Table_combination.id
, Table_combination.name
, (
CASE
WHEN Table1.id IS NULL
THEN 0
ELSE 1
END
) AS rating
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS Table_combination
LEFT JOIN
Table1
ON
Table_combination.id = Table1.id
AND
Table_combination.name = Table1.name
ORDER BY
Table_combination.id ASC
, Table_combination.name ASC
结果
| id | name | rating |
|----|------|--------|
| 1 | A | 1 |
| 1 | B | 0 |
| 1 | C | 1 |
| 1 | D | 0 |
| 1 | E | 1 |
| 1 | F | 0 |
| 2 | A | 1 |
| 2 | B | 1 |
| 2 | C | 1 |
| 2 | D | 0 |
| 2 | E | 1 |
| 2 | F | 1 |
| 3 | A | 1 |
| 3 | B | 0 |
| 3 | C | 0 |
| 3 | D | 1 |
| 3 | E | 1 |
| 3 | F | 0 |
我根据 Raymond Nijlands 的回答做了一个函数:
def expand_grid(df, df_name, col_a, col_b, col_c):
df.createOrReplaceTempView(df_name)
expand_sql = f"""
SELECT
expanded.{col_a},
expanded.{col_b},
CASE
WHEN {df_name}.{col_c} IS NULL THEN 0
ELSE {df_name}.{col_c}
END AS {col_c}
FROM (
SELECT *
FROM (
SELECT DISTINCT {col_a}
FROM {df_name}
) AS {col_a}s
CROSS JOIN (
SELECT DISTINCT {col_b}
FROM {df_name}
) AS {col_b}s
) AS expanded
LEFT JOIN {df_name}
ON expanded.{col_a} = {df_name}.{col_a}
AND expanded.{col_b} = {df_name}.{col_b}
"""
print(expand_sql)
result = spark.sql(expand_sql)
return result
在这个问题的上下文中的用法:
expand_grid(df=df, df_name="df_name", col_a="id", col_b="name", col_c="rating")