BigQuery - 删除特定的重复记录
BigQuery - Remove specific duplicate records
我有一个 BigQuery table 包含如下数据:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Select 1445001 A232 7380 Vendor
20151021 Air Select 1445001 A232 7380 Vendor
20151021 Air Select 1445001 A232 7380 Vendor
如您所见,有一系列重复记录。我想以 每个重复记录集中的重复记录之一结束 。例如:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Select 1445001 A232 7380 Vendor
我该怎么做?
提前致谢!
您可以对重复项进行分组。保留一行,并从重复组中删除剩余的行:
试试这个(我假设 table 名称和其他字段)
;WITH rmvDuplicate
AS (SELECT ROW_NUMBER() OVER (PARTITION BY [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
ORDER BY (SELECT 0)) dup
FROM BigQuery_table)
DELETE FROM rmvDuplicate
WHERE dup > 1
您可以使用 DISTINCT 子句,也可以对数据进行分组。这些会将每个唯一条目的返回数据聚合到一行中。
SELECT DISTINCT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]
--OR
SELECT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]
GROUP BY [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
注意:这不会删除您的重复数据,只是不会显示在您的 select 语句的结果中。如果您希望永久删除重复的条目,请使用@singhsac 的回复,利用 window 函数。
我有一个 BigQuery table 包含如下数据:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Select 1445001 A232 7380 Vendor
20151021 Air Select 1445001 A232 7380 Vendor
20151021 Air Select 1445001 A232 7380 Vendor
如您所见,有一系列重复记录。我想以 每个重复记录集中的重复记录之一结束 。例如:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 CurrentLocation
20151021 Air Select 1445001 A232 7380 Vendor
我该怎么做?
提前致谢!
您可以对重复项进行分组。保留一行,并从重复组中删除剩余的行:
试试这个(我假设 table 名称和其他字段)
;WITH rmvDuplicate
AS (SELECT ROW_NUMBER() OVER (PARTITION BY [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
ORDER BY (SELECT 0)) dup
FROM BigQuery_table)
DELETE FROM rmvDuplicate
WHERE dup > 1
您可以使用 DISTINCT 子句,也可以对数据进行分组。这些会将每个唯一条目的返回数据聚合到一行中。
SELECT DISTINCT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]
--OR
SELECT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]
GROUP BY [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
注意:这不会删除您的重复数据,只是不会显示在您的 select 语句的结果中。如果您希望永久删除重复的条目,请使用@singhsac 的回复,利用 window 函数。