如何计算 bigquery 数组字段中元素的频率
How to count frequency of elements in a bigquery array field
我有一个 table 看起来像这样:
我正在寻找 table,它给出字段 l_0, l_1, l_2, l_3
.
中元素的频率计数
例如,输出应如下所示:
| author_id | year | l_o.name | l_0.count| l1.name | l1.count | l2.name | l2.count| l3.name | l3.count|
| 2164089123 | 1987 | biology | 3 | botany | 3 | | | | |
| 2595831531 | 1987 | computer science | 2 | simulation | 2 | computer simulation | 2 | mathematical model | 2 |
编辑:
在某些情况下,数组字段可能有不止一种类型的元素。例如 l_0
可能是 ['biology', 'biology', 'geometry', 'geometry']
。在这种情况下,字段 l_0, l_1, l_2, l_3
的输出将是一个嵌套的重复字段,其中包含 l_0.name
中的所有元素以及 l_0.count
.
中的所有相应计数
这应该可以工作,假设您想在每个数组的基础上进行计数:
SELECT
author_id,
year,
(SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
FROM UNNEST(l_0) AS l_0) AS l_0,
(SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
FROM UNNEST(l_1) AS l_1) AS l_1,
(SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
FROM UNNEST(l_2) AS l_2) AS l_2,
(SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;
为了避免太多重复,您可以使用 SQL UDF:
CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
(SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem)
);
SELECT
author_id,
year,
GetNameAndCount(l_0) AS l_0,
GetNameAndCount(l_1) AS l_1,
GetNameAndCount(l_2) AS l_2,
GetNameAndCount(l_3) AS l_3
FROM YourTable;
如果您可能需要考虑一个数组中的多个不同名称,您可以使用 UDF return 一个包含相关计数的名称数组:
CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
ARRAY(
SELECT AS STRUCT elem AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem
GROUP BY elem
ORDER BY count
)
);
SELECT
author_id,
year,
GetNamesAndCounts(l_0) AS l_0,
GetNamesAndCounts(l_1) AS l_1,
GetNamesAndCounts(l_2) AS l_2,
GetNamesAndCounts(l_3) AS l_3
FROM YourTable;
请注意,如果要跨行计数,则需要取消嵌套数组并在外层执行 GROUP BY
,但看起来这不是您的意图基于问题。
我有一个 table 看起来像这样:
我正在寻找 table,它给出字段 l_0, l_1, l_2, l_3
.
例如,输出应如下所示:
| author_id | year | l_o.name | l_0.count| l1.name | l1.count | l2.name | l2.count| l3.name | l3.count|
| 2164089123 | 1987 | biology | 3 | botany | 3 | | | | |
| 2595831531 | 1987 | computer science | 2 | simulation | 2 | computer simulation | 2 | mathematical model | 2 |
编辑:
在某些情况下,数组字段可能有不止一种类型的元素。例如 l_0
可能是 ['biology', 'biology', 'geometry', 'geometry']
。在这种情况下,字段 l_0, l_1, l_2, l_3
的输出将是一个嵌套的重复字段,其中包含 l_0.name
中的所有元素以及 l_0.count
.
这应该可以工作,假设您想在每个数组的基础上进行计数:
SELECT
author_id,
year,
(SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
FROM UNNEST(l_0) AS l_0) AS l_0,
(SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
FROM UNNEST(l_1) AS l_1) AS l_1,
(SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
FROM UNNEST(l_2) AS l_2) AS l_2,
(SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;
为了避免太多重复,您可以使用 SQL UDF:
CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
(SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem)
);
SELECT
author_id,
year,
GetNameAndCount(l_0) AS l_0,
GetNameAndCount(l_1) AS l_1,
GetNameAndCount(l_2) AS l_2,
GetNameAndCount(l_3) AS l_3
FROM YourTable;
如果您可能需要考虑一个数组中的多个不同名称,您可以使用 UDF return 一个包含相关计数的名称数组:
CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
ARRAY(
SELECT AS STRUCT elem AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem
GROUP BY elem
ORDER BY count
)
);
SELECT
author_id,
year,
GetNamesAndCounts(l_0) AS l_0,
GetNamesAndCounts(l_1) AS l_1,
GetNamesAndCounts(l_2) AS l_2,
GetNamesAndCounts(l_3) AS l_3
FROM YourTable;
请注意,如果要跨行计数,则需要取消嵌套数组并在外层执行 GROUP BY
,但看起来这不是您的意图基于问题。