根据时间戳 Kusto 查询删除重复项
Remove duplicates based on Timestamp Kusto Query
我有两个 table,就像下面 Kusto 中的那样。我正在尝试根据 name/usernames 加入 table,但保留第二个 table 的行,即使第一个 table 没有匹配项,也删除如果用户名和电子邮件相同,则根据时间戳从第二个 table 重复(在这种情况下,我会保留最新的信息 - 最新时间戳)
Table 1
Name | pets | color | city
A | A1 | blue | NYC
A | A2 | blue | NYC
A | A3 | blue | NYC
B | B1 | red | Boston
C | C1 | yellow| Miami
C | C2 | yellow| Miami
Table 2
username | email | school | timestamp
A | a@whatever.com | schoolA | 10pm
B | b@whatever.com | schoolB1 | 10pm
B | b@whatever.com | schoolB2 | 11pm
C | c@whatever.com | schoolC | 9pm
D | d@whatever.com | schoolD | 11pm
E | e@whatever.com | schoolE | 10pm
Table results I want
name | pets | color | city | email | school | timestamp
A | A1 | blue | NYC | a@whatever.com | schoolA | 10pm
A | A2 | blue | NYC | a@whatever.com | schoolA | 10pm
A | A3 | blue | NYC | a@whatever.com | schoolA | 10pm
B | B1 | red | Boston| b@whatever.com | schoolB2 | 11pm
C | C1 | yellow | Miami | c@whatever.com | schoolC | 9pm
C | C2 | yellow | Miami | c@whatever.com | schoolC | 9pm
D | | | | d@whatever.com | schoolD | 11pm
E | | | | e@whatever.com | schoolE | 10pm
如果我没理解错的话,下面的查询是可行的。
它使用:
- arg_max() (aggregation function): "如果用户名和电子邮件相同,则根据时间戳从第二个 table 中删除重复项(在这种情况下,我会保留来自最最近 -- 最新时间戳)"
- Right outer-join flavor:“保留第二个 table 的行,即使第一个 table 没有匹配项”
let T1 = datatable(name:string, pets:string, color:string, city:string)
[
"A", "A1", "blue", "NYC",
"A", "A2", "blue", "NYC",
"A", "A3", "blue", "NYC",
"B", "B1", "red ", "Boston",
"C", "C1", "yellow", "Miami",
"C", "C2", "yellow", "Miami",
]
;
let T2 = datatable(username:string, email:string, school:string, timestamp:datetime)
[
"A", "a@whatever.com", "schoolA", datetime(2020-11-24 22:00),
"B", "b@whatever.com", "schoolB1", datetime(2020-11-24 22:00),
"B", "b@whatever.com", "schoolB2", datetime(2020-11-24 23:00),
"C", "c@whatever.com", "schoolC", datetime(2020-11-24 21:00),
"D", "d@whatever.com", "schoolD", datetime(2020-11-24 23:00),
"E", "e@whatever.com", "schoolE", datetime(2020-11-24 22:00),
]
;
T1
| join kind=rightouter (
T2
| summarize arg_max(timestamp, *) by username, email
) on $left.name == $right.username
| project name = username, pets, color, city, email, school, timestamp
| order by name asc, pets asc
| name | pets | color | city | email | school | timestamp |
|------|------|--------|--------|----------------|----------|-----------------------------|
| A | A1 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| A | A2 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| A | A3 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| B | B1 | red | Boston | b@whatever.com | schoolB2 | 2020-11-24 23:00:00.0000000 |
| C | C1 | yellow | Miami | c@whatever.com | schoolC | 2020-11-24 21:00:00.0000000 |
| C | C2 | yellow | Miami | c@whatever.com | schoolC | 2020-11-24 21:00:00.0000000 |
| D | | | | d@whatever.com | schoolD | 2020-11-24 23:00:00.0000000 |
| E | | | | e@whatever.com | schoolE | 2020-11-24 22:00:00.0000000 |
我有两个 table,就像下面 Kusto 中的那样。我正在尝试根据 name/usernames 加入 table,但保留第二个 table 的行,即使第一个 table 没有匹配项,也删除如果用户名和电子邮件相同,则根据时间戳从第二个 table 重复(在这种情况下,我会保留最新的信息 - 最新时间戳)
Table 1
Name | pets | color | city
A | A1 | blue | NYC
A | A2 | blue | NYC
A | A3 | blue | NYC
B | B1 | red | Boston
C | C1 | yellow| Miami
C | C2 | yellow| Miami
Table 2
username | email | school | timestamp
A | a@whatever.com | schoolA | 10pm
B | b@whatever.com | schoolB1 | 10pm
B | b@whatever.com | schoolB2 | 11pm
C | c@whatever.com | schoolC | 9pm
D | d@whatever.com | schoolD | 11pm
E | e@whatever.com | schoolE | 10pm
Table results I want
name | pets | color | city | email | school | timestamp
A | A1 | blue | NYC | a@whatever.com | schoolA | 10pm
A | A2 | blue | NYC | a@whatever.com | schoolA | 10pm
A | A3 | blue | NYC | a@whatever.com | schoolA | 10pm
B | B1 | red | Boston| b@whatever.com | schoolB2 | 11pm
C | C1 | yellow | Miami | c@whatever.com | schoolC | 9pm
C | C2 | yellow | Miami | c@whatever.com | schoolC | 9pm
D | | | | d@whatever.com | schoolD | 11pm
E | | | | e@whatever.com | schoolE | 10pm
如果我没理解错的话,下面的查询是可行的。
它使用:
- arg_max() (aggregation function): "如果用户名和电子邮件相同,则根据时间戳从第二个 table 中删除重复项(在这种情况下,我会保留来自最最近 -- 最新时间戳)"
- Right outer-join flavor:“保留第二个 table 的行,即使第一个 table 没有匹配项”
let T1 = datatable(name:string, pets:string, color:string, city:string)
[
"A", "A1", "blue", "NYC",
"A", "A2", "blue", "NYC",
"A", "A3", "blue", "NYC",
"B", "B1", "red ", "Boston",
"C", "C1", "yellow", "Miami",
"C", "C2", "yellow", "Miami",
]
;
let T2 = datatable(username:string, email:string, school:string, timestamp:datetime)
[
"A", "a@whatever.com", "schoolA", datetime(2020-11-24 22:00),
"B", "b@whatever.com", "schoolB1", datetime(2020-11-24 22:00),
"B", "b@whatever.com", "schoolB2", datetime(2020-11-24 23:00),
"C", "c@whatever.com", "schoolC", datetime(2020-11-24 21:00),
"D", "d@whatever.com", "schoolD", datetime(2020-11-24 23:00),
"E", "e@whatever.com", "schoolE", datetime(2020-11-24 22:00),
]
;
T1
| join kind=rightouter (
T2
| summarize arg_max(timestamp, *) by username, email
) on $left.name == $right.username
| project name = username, pets, color, city, email, school, timestamp
| order by name asc, pets asc
| name | pets | color | city | email | school | timestamp |
|------|------|--------|--------|----------------|----------|-----------------------------|
| A | A1 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| A | A2 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| A | A3 | blue | NYC | a@whatever.com | schoolA | 2020-11-24 22:00:00.0000000 |
| B | B1 | red | Boston | b@whatever.com | schoolB2 | 2020-11-24 23:00:00.0000000 |
| C | C1 | yellow | Miami | c@whatever.com | schoolC | 2020-11-24 21:00:00.0000000 |
| C | C2 | yellow | Miami | c@whatever.com | schoolC | 2020-11-24 21:00:00.0000000 |
| D | | | | d@whatever.com | schoolD | 2020-11-24 23:00:00.0000000 |
| E | | | | e@whatever.com | schoolE | 2020-11-24 22:00:00.0000000 |