如何使用 Snowflake Javascript 存储过程或函数遍历 table 中的所有列?
How to iterate over all columns in a table using Snowflake Javascript Stored Procedure or Function?
我在 Snowflake 中有一个包含 100 多列的 table,我试图计算每列中所有不同值的数量,并最终将每列的所有数量连接成一个 table.如果我只在一列上做它会是这样的:
SELECT DISTINCT "AGE", count(*) AS "Frequency"
FROM
db.schema.tablename
WHERE
"SURVEYDATE" < "2019-07-29"
GROUP BY
AGE;
我知道这在 Python 中做起来有些微不足道(也许我应该只在 PySpark 中做,我愿意接受建议),但我认为两者都易于使用我的团队和更快地处理 3 亿行,我想使用 Snowflake Javascript 程序语言来做类似的事情:
create or replace procedure column_counts(table)
returns array
language javascript
as
$$
var num_columns = //get number of columns
var columns = [list of columns]
var results_array = [];
for (i = 0; i < num_columns; i++) {
var col_count = snowflake.createStatement( {sqlText: 'SELECT DISTINCT columns[i], count(*) AS "Frequency" FROM
db.schema.tablename WHERE "SURVEYDATE" < "2019-07-29" GROUP BY columns[i]' }).execute(); //This returns a table of all distinct values in that column and their counts
results_array.push([columns[i], col_count]) //I then want an array like [column_name[0...i], distinct_value[0....n], frequency]
return results_array;
$$
;
CALL column_counts();
我对在 Snowflake 和整个 Snowflake 中使用这种过程语言还是很陌生,所以绝对愿意接受关于如何最好地做到这一点的建议,并以一种重复table的方式为新 table 每个月都有。
没有任何程序代码也是可能的。例如使用 JSON:
WITH cte AS ( -- here goes the table/query/view
SELECT TOP 100 OBJECT_CONSTRUCT(*) AS json_payload
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
)
SELECT f.KEY,
COUNT(DISTINCT f."VALUE") AS frequency,
LISTAGG(DISTINCT f."VALUE" ,',') AS distinct_values -- debug
FROM cte
, LATERAL FLATTEN (input => json_payload) f
-- WHERE f.KEY IN ('column_name1', 'column_name2', ...) -- only specific columns
GROUP BY f.KEY;
输出:
+-----------------+-----------+------------------------------------------------+
| KEY | FREQUENCY | DISTINCT_VALUES |
+-----------------+-----------+------------------------------------------------+
| O_ORDERPRIORITY | 5 | 2-HIGH,1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM |
| O_ORDERSTATUS | 3 | P,O,F |
| O_SHIPPRIORITY | 1 | 0 |
| ... | ... | .... |
+-----------------+-----------+------------------------------------------------+
工作原理:
使用 OBJECT_CONSTRUCT(*)
每行生成 JSON
将 JSON 扁平化为 key/value
按键分组并应用特定的聚合函数COUNT/COUNT(DISTINCT )/LISTAGG/MIN/MAX/...
每个版本提供分发 column/value:
WITH cte AS (
SELECT TOP 100 OBJECT_CONSTRUCT(*) AS json_payload
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
)
SELECT f.KEY, f."VALUE", COUNT(*) AS frequency
FROM cte
, LATERAL FLATTEN (input => json_payload) f
-- WHERE f.KEY IN ('column_name1', 'column_name2', ...) -- only specific columns
GROUP BY f.KEY, f."VALUE"
ORDER BY f.KEY, f."VALUE";
我在 Snowflake 中有一个包含 100 多列的 table,我试图计算每列中所有不同值的数量,并最终将每列的所有数量连接成一个 table.如果我只在一列上做它会是这样的:
SELECT DISTINCT "AGE", count(*) AS "Frequency"
FROM
db.schema.tablename
WHERE
"SURVEYDATE" < "2019-07-29"
GROUP BY
AGE;
我知道这在 Python 中做起来有些微不足道(也许我应该只在 PySpark 中做,我愿意接受建议),但我认为两者都易于使用我的团队和更快地处理 3 亿行,我想使用 Snowflake Javascript 程序语言来做类似的事情:
create or replace procedure column_counts(table)
returns array
language javascript
as
$$
var num_columns = //get number of columns
var columns = [list of columns]
var results_array = [];
for (i = 0; i < num_columns; i++) {
var col_count = snowflake.createStatement( {sqlText: 'SELECT DISTINCT columns[i], count(*) AS "Frequency" FROM
db.schema.tablename WHERE "SURVEYDATE" < "2019-07-29" GROUP BY columns[i]' }).execute(); //This returns a table of all distinct values in that column and their counts
results_array.push([columns[i], col_count]) //I then want an array like [column_name[0...i], distinct_value[0....n], frequency]
return results_array;
$$
;
CALL column_counts();
我对在 Snowflake 和整个 Snowflake 中使用这种过程语言还是很陌生,所以绝对愿意接受关于如何最好地做到这一点的建议,并以一种重复table的方式为新 table 每个月都有。
没有任何程序代码也是可能的。例如使用 JSON:
WITH cte AS ( -- here goes the table/query/view
SELECT TOP 100 OBJECT_CONSTRUCT(*) AS json_payload
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
)
SELECT f.KEY,
COUNT(DISTINCT f."VALUE") AS frequency,
LISTAGG(DISTINCT f."VALUE" ,',') AS distinct_values -- debug
FROM cte
, LATERAL FLATTEN (input => json_payload) f
-- WHERE f.KEY IN ('column_name1', 'column_name2', ...) -- only specific columns
GROUP BY f.KEY;
输出:
+-----------------+-----------+------------------------------------------------+
| KEY | FREQUENCY | DISTINCT_VALUES |
+-----------------+-----------+------------------------------------------------+
| O_ORDERPRIORITY | 5 | 2-HIGH,1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM |
| O_ORDERSTATUS | 3 | P,O,F |
| O_SHIPPRIORITY | 1 | 0 |
| ... | ... | .... |
+-----------------+-----------+------------------------------------------------+
工作原理:
使用
每行生成 JSONOBJECT_CONSTRUCT(*)
将 JSON 扁平化为 key/value
按键分组并应用特定的聚合函数
COUNT/COUNT(DISTINCT )/LISTAGG/MIN/MAX/...
每个版本提供分发 column/value:
WITH cte AS (
SELECT TOP 100 OBJECT_CONSTRUCT(*) AS json_payload
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
)
SELECT f.KEY, f."VALUE", COUNT(*) AS frequency
FROM cte
, LATERAL FLATTEN (input => json_payload) f
-- WHERE f.KEY IN ('column_name1', 'column_name2', ...) -- only specific columns
GROUP BY f.KEY, f."VALUE"
ORDER BY f.KEY, f."VALUE";