基于 BigQuery 中的列子集标记重复项(首先除外)的有效方法

Efficient way of marking duplicates (except first) based on a subset of columns in BigQuery

所以我有一个看起来像这样的数据集:

user_ID order_ID order_start_date is_returning?
1 1234 23-Mar-2021 0
2 1235 23-Mar-2021 0
2 1236 23-Mar-2021 1
1 1237 24-Mar-2021 1
3 1238 28-Mar-2021 0

除了 is_returning 列,我希望根据之前是否看到 user_ID 来计算它。

在 pandas 中,它是一个简单的单行代码:

all_data['is_returning'] = all_data.user_id.duplicated().astype(int) 

但是在 BQ 中,我还没有找到一种直接的方法。截至目前,我已经得到

(SELECT COUNT(USER_ID) AS num_users_with_2_orders FROM
 (SELECT COUNT(DISTINCT ORDER_NUMBER) AS ORDERS, USER_ID FROM `my_project.my_dataset.my_table` WHERE
    order_number IN (SELECT DISTINCT ORDER_NUMBER FROM (SELECT order_number, sum(usd_total_price) as totals
         FROM `my_project.my_dataset.my_table`
         GROUP BY order_number
         HAVING totals <= @maximum_value AND totals >= @minimum_value))
   AND USER_ID IN
   (SELECT DISTINCT USER_ID
     FROM `my_project.my_dataset.my_table` WHERE
     USER_ID <> -1 AND
     
     CAST(order_start_date AS DATE) >= PARSE_DATE('%Y%m%d', @DS_START_DATE) AND CAST(order_start_date AS DATE) <= PARSE_DATE('%Y%m%d', @DS_END_DATE))

      
 GROUP BY USER_ID
 HAVING ORDERS >= 2))

然后我用这些user_ID来比较。如果它们是原始的 table,我说它返回。虽然过于复杂,但它并没有达到 keep = 'first' 参数的目的,因为它标记了所有订单,不排除第一个订单。为此,我必须应用按 user_ID.

分组的附加条件 order_start_date <> MIN(ORDER_START_DATE)

所以我的问题是:实现相同目标的更有效方法是什么?

在 BigQuery 中,这可以通过 row_number:

等分析函数来实现
with my_table as (
  select 1 as user_id, 1234 as order_id union all
  select 2, 1235 union all
  select 2, 1236 union all
  select 1, 1237 union all
  select 3, 1238
)
select 
  user_id,
  order_id,
  IF(ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_id) > 1, 1, 0) AS is_returning
from my_table
order by order_id