MYSQL:排除同一天内重复的扫描日志

MYSQL: Exclude duplicate scan logs within same day

我试图在一天内 select 行排除重复项。 重复的标准是:SAME USER AND SAME PRODUCT_UPC AND SAME DATE(SCANNED_ON)

所以,从下面的 table,如果 SCAN_ID = 100 是 selected,排除 SCAN_ID = 101,因为它们属于同一个 user_id AND 相同 product_upc 并且具有相同的 DATE(scanned_on).

这是 table 结构:

SCAN_ID      USER_ID      PRODUCT_UPC      SCANNED_ON
100          1            0767914767       2020-08-01 03:49:11
101          1            0767914767       2020-08-01 03:58:28
102          2            0064432050       2020-08-02 04:01:31
103          3            0804169977       2020-08-10 04:08:48
104          4            0875523846       2020-08-10 05:21:32
105          4            0007850492       2020-08-12 07:10:05

到目前为止我提出的查询是:

SET @last_user='', @last_upc='', @last_date='';
SELECT *,
@last_user as last_user , @last_user:=user_id as this_user,
@last_upc as last_upc , @last_upc:=product_upc as this_upc,
@last_date as last_date , @last_date:=DATE(scanned_on) as this_date
FROM scansv2
HAVING this_user != last_user OR this_upc != last_upc OR this_date != last_date

MySQL 8 中你可以使用 ROW_NUMVER

CREATE TABLE scansv2 (
  `SCAN_ID` INTEGER,
  `USER_ID` INTEGER,
  `PRODUCT_UPC` INTEGER,
  `SCANNED_ON` DATETIME
);

INSERT INTO scansv2
  (`SCAN_ID`, `USER_ID`, `PRODUCT_UPC`, `SCANNED_ON`)
VALUES
  ('100', '1', '0767914767', '2020-08-01 03:49:11'),
  ('101', '1', '0767914767', '2020-08-01 03:58:28'),
  ('102', '2', '0064432050', '2020-08-02 04:01:31'),
  ('103', '3', '0804169977', '2020-08-10 04:08:48'),
  ('104', '4', '0875523846', '2020-08-10 05:21:32'),
  ('105', '4', '0007850492', '2020-08-12 07:10:05');
WITH rownum  AS (SELECT `SCAN_ID`, `USER_ID`, `PRODUCT_UPC`, `SCANNED_ON`,ROW_NUMBER() OVER (
          PARTITION BY `PRODUCT_UPC` 
          ORDER BY `SCANNED_ON` DESC) row_num FROM scansv2)
SELECT `SCAN_ID`, `USER_ID`, `PRODUCT_UPC`, `SCANNED_ON` FROM rownum WHERE row_num =  1 ORDER BY `SCAN_ID` 
SCAN_ID | USER_ID | PRODUCT_UPC | SCANNED_ON         
------: | ------: | ----------: | :------------------
    101 |       1 |   767914767 | 2020-08-01 03:58:28
    102 |       2 |    64432050 | 2020-08-02 04:01:31
    103 |       3 |   804169977 | 2020-08-10 04:08:48
    104 |       4 |   875523846 | 2020-08-10 05:21:32
    105 |       4 |     7850492 | 2020-08-12 07:10:05

db<>fiddle here

在 MySQL 5.x 中,您需要用户定义的变量用于相同的目的

SELECT `SCAN_ID`, `USER_ID`, `PRODUCT_UPC`, `SCANNED_ON`
FROM
 (SELECT `SCAN_ID`, `USER_ID`, `SCANNED_ON`,
          IF (@product = `PRODUCT_UPC`,@row_num := @row_num + 1,@row_num := 1) row_num 
          , @product := `PRODUCT_UPC` PRODUCT_UPC
          FROM (SELECT * FROM scansv2 ORDER BY `PRODUCT_UPC`, `SCANNED_ON`) c,(SELECT @row_num := 0,@product := 0) a ) b
WHERE row_num =  1 ORDER BY `SCAN_ID` 
SCAN_ID | USER_ID | PRODUCT_UPC | SCANNED_ON         
------: | ------: | ----------: | :------------------
    100 |       1 |   767914767 | 2020-08-01 03:49:11
    102 |       2 |    64432050 | 2020-08-02 04:01:31
    103 |       3 |   804169977 | 2020-08-10 04:08:48
    104 |       4 |   875523846 | 2020-08-10 05:21:32
    105 |       4 |     7850492 | 2020-08-12 07:10:05

db<>fiddle here

在大多数数据库中(包括 MySQL 8.0 之前的版本),使用子查询进行过滤是一种受支持且高效的选项:

select s.*
from scansv2 s
where s.scanned_on = (
    select min(s1.scanned_on)
    from scansv2 s1
    where 
        s1.user_id = s.user_id 
        and s1.product_upc = s.product_upc
        and s1.scanned_on >= date(s.scanned_on)
        and s1.scanned_on <  date(s.scanned_on) + interval 1 day
)

这会为您提供每个 user_idproduct_upc 和每天的第一行,并过滤掉其他行(如果有)。