基于 BigQuery 中的列子集标记重复项(首先除外)的有效方法
Efficient way of marking duplicates (except first) based on a subset of columns in BigQuery
所以我有一个看起来像这样的数据集:
user_ID
order_ID
order_start_date
is_returning?
1
1234
23-Mar-2021
0
2
1235
23-Mar-2021
0
2
1236
23-Mar-2021
1
1
1237
24-Mar-2021
1
3
1238
28-Mar-2021
0
除了 is_returning
列,我希望根据之前是否看到 user_ID 来计算它。
在 pandas 中,它是一个简单的单行代码:
all_data['is_returning'] = all_data.user_id.duplicated().astype(int)
但是在 BQ 中,我还没有找到一种直接的方法。截至目前,我已经得到
(SELECT COUNT(USER_ID) AS num_users_with_2_orders FROM
(SELECT COUNT(DISTINCT ORDER_NUMBER) AS ORDERS, USER_ID FROM `my_project.my_dataset.my_table` WHERE
order_number IN (SELECT DISTINCT ORDER_NUMBER FROM (SELECT order_number, sum(usd_total_price) as totals
FROM `my_project.my_dataset.my_table`
GROUP BY order_number
HAVING totals <= @maximum_value AND totals >= @minimum_value))
AND USER_ID IN
(SELECT DISTINCT USER_ID
FROM `my_project.my_dataset.my_table` WHERE
USER_ID <> -1 AND
CAST(order_start_date AS DATE) >= PARSE_DATE('%Y%m%d', @DS_START_DATE) AND CAST(order_start_date AS DATE) <= PARSE_DATE('%Y%m%d', @DS_END_DATE))
GROUP BY USER_ID
HAVING ORDERS >= 2))
然后我用这些user_ID来比较。如果它们是原始的 table,我说它返回。虽然过于复杂,但它并没有达到 keep = 'first'
参数的目的,因为它标记了所有订单,不排除第一个订单。为此,我必须应用按 user_ID
.
分组的附加条件 order_start_date <> MIN(ORDER_START_DATE)
所以我的问题是:实现相同目标的更有效方法是什么?
在 BigQuery 中,这可以通过 row_number:
等分析函数来实现
with my_table as (
select 1 as user_id, 1234 as order_id union all
select 2, 1235 union all
select 2, 1236 union all
select 1, 1237 union all
select 3, 1238
)
select
user_id,
order_id,
IF(ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_id) > 1, 1, 0) AS is_returning
from my_table
order by order_id
所以我有一个看起来像这样的数据集:
user_ID | order_ID | order_start_date | is_returning? |
---|---|---|---|
1 | 1234 | 23-Mar-2021 | 0 |
2 | 1235 | 23-Mar-2021 | 0 |
2 | 1236 | 23-Mar-2021 | 1 |
1 | 1237 | 24-Mar-2021 | 1 |
3 | 1238 | 28-Mar-2021 | 0 |
除了 is_returning
列,我希望根据之前是否看到 user_ID 来计算它。
在 pandas 中,它是一个简单的单行代码:
all_data['is_returning'] = all_data.user_id.duplicated().astype(int)
但是在 BQ 中,我还没有找到一种直接的方法。截至目前,我已经得到
(SELECT COUNT(USER_ID) AS num_users_with_2_orders FROM
(SELECT COUNT(DISTINCT ORDER_NUMBER) AS ORDERS, USER_ID FROM `my_project.my_dataset.my_table` WHERE
order_number IN (SELECT DISTINCT ORDER_NUMBER FROM (SELECT order_number, sum(usd_total_price) as totals
FROM `my_project.my_dataset.my_table`
GROUP BY order_number
HAVING totals <= @maximum_value AND totals >= @minimum_value))
AND USER_ID IN
(SELECT DISTINCT USER_ID
FROM `my_project.my_dataset.my_table` WHERE
USER_ID <> -1 AND
CAST(order_start_date AS DATE) >= PARSE_DATE('%Y%m%d', @DS_START_DATE) AND CAST(order_start_date AS DATE) <= PARSE_DATE('%Y%m%d', @DS_END_DATE))
GROUP BY USER_ID
HAVING ORDERS >= 2))
然后我用这些user_ID来比较。如果它们是原始的 table,我说它返回。虽然过于复杂,但它并没有达到 keep = 'first'
参数的目的,因为它标记了所有订单,不排除第一个订单。为此,我必须应用按 user_ID
.
order_start_date <> MIN(ORDER_START_DATE)
所以我的问题是:实现相同目标的更有效方法是什么?
在 BigQuery 中,这可以通过 row_number:
等分析函数来实现with my_table as (
select 1 as user_id, 1234 as order_id union all
select 2, 1235 union all
select 2, 1236 union all
select 1, 1237 union all
select 3, 1238
)
select
user_id,
order_id,
IF(ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_id) > 1, 1, 0) AS is_returning
from my_table
order by order_id