TSQL - 运行 初始查询中 "duplicates"/误报的日期比较?
TSQL - Run date comparison for "duplicates"/false positives on initial query?
我是 SQL 的新手,正在努力从几个非常大的表中提取一些数据进行分析。数据基本上是系统上资产的触发事件。这些事件都有一个我关心的 created_date (日期时间)字段。
我能够将下面的查询放在一起以获得我需要的数据 (YAY):
SELECT
event.efkey
,event.e_id
,event.e_key
,l.l_name
,event.created_date
,asset.a_id
,asset.asset_name
FROM event
LEFT JOIN asset
ON event.a_key = asset.a_key
LEFT JOIN l
ON event.l_key = l.l_key
WHERE event.e_key IN (350, 352, 378)
ORDER BY asset.a_id, event.created_date
然而,虽然这为我提供了我想要的特定事件的数据,但我还有另一个问题。资产可以反复触发这些事件,这可能会导致我正在查看的内容出现大量 "false positives"。
我需要做的是检查上面查询的结果集,并删除发生时间小于 N 分钟的资产的任何事件(对于本例来说是 30 分钟)。因此,如果 asset_ID 相同并且 event.created_date 在集合中该资产的另一个事件的 30 分钟内,那么我希望将其删除。例如:
以下记录
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-01 12:35:31
a_id 1124 created 2016-02-01 12:40:33
a_id 1124 created 2016-02-01 12:45:42
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
我只想 return:
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
我尝试引用 this and ,但我无法使那里的概念对我有用。我知道我可能需要做一个 SELECT * FROM(我现有的查询)但我似乎无法做到这一点而不会以大量 "multi-part identifier can't be bound" 错误告终(而且我没有创建临时表的经验,到目前为止我的尝试都失败了)。我也不太确定如何使用 DATEDIFF 作为日期过滤函数。
如有任何帮助,我们将不胜感激!如果你能为新手简化它(或 link 解释),那也会有帮助!
-- Sample data.
declare @Samples as Table ( Id Int Identity, A_Id Int, CreatedDate DateTime );
insert into @Samples ( A_Id, CreatedDate ) values
( 1124, '2016-02-01 12:30:30' ),
( 1124, '2016-02-01 12:35:31' ),
( 1124, '2016-02-01 12:40:33' ),
( 1124, '2016-02-01 12:45:42' ),
( 1124, '2016-02-02 12:30:30' ),
( 1124, '2016-02-02 13:00:30' ),
( 1125, '2016-02-01 12:30:30' );
select * from @Samples;
-- Calculate the windows of 30 minutes before and after each CreatedDate and check for conflicts with other rows.
with Ranges as (
select Id, A_Id, CreatedDate,
DateAdd( minute, -30, S.CreatedDate ) as RangeStart, DateAdd( minute, 30, S.CreatedDate ) as RangeEnd
from @Samples as S )
select Id, A_Id, CreatedDate, RangeStart, RangeEnd,
-- Check for a conflict with another row with:
-- the same A_Id value and an earlier CreatedDate that falls inside the +/-30 minute range.
case when exists ( select 42 from @Samples where A_Id = R.A_Id and CreatedDate < R.CreatedDate and R.RangeStart < CreatedDate and CreatedDate < R.RangeEnd ) then 1
else 0 end as Conflict
from Ranges as R;
这是一个比最初看起来更棘手的问题。困难的部分是捕获前一个好行并删除下一个坏行,但不允许这些坏行影响下一行是否好。这是我想出的。我试图解释代码中的注释是怎么回事。
--sample data since I don't have your table structure and your original query won't work for me
declare @events table
(
id int,
timestamp datetime
)
--note that I changed some of your sample data to test some different scenarios
insert into @events values( 1124, '2016-02-01 12:30:30')
insert into @events values( 1124, '2016-02-01 12:35:31')
insert into @events values( 1124, '2016-02-01 12:40:33')
insert into @events values( 1124, '2016-02-01 13:05:42')
insert into @events values( 1124, '2016-02-02 12:30:30')
insert into @events values( 1124, '2016-02-02 13:00:30')
insert into @events values( 1115, '2016-02-01 12:30:30')
--using a cte here to split the result set of your query into groups
--by id (you would want to partition by whatever criteria you use
--to determine that rows are talking about the same event)
--the row_number function gets the row number for each row within that
--id partition
--the over clause specifies how to break up the result set into groups
--(partitions) and what order to put the rows in within that group so
--that the numbering stays consistant
;with orderedEvents as
(
select id, timestamp, row_number() over (partition by id order by timestamp) as rn
from @events
--you would replace @events here with your query
)
--using a second recursive cte here to determine which rows are "good"
--and which ones are not.
, previousGoodTimestamps as
(
--this is the "seeding" part of the recursive cte where I pick the
--first rows of each group as being a desired result. Since they
--are the first in each group, I know they are good. I also assign
--their timestamp as the previous good timestamp since I know that
--this row is good.
select id, timestamp, rn, timestamp as prev_good_timestamp, 1 as is_good
from orderedEvents
where rn = 1
union all
--this is the recursive part of the cte. It takes the rows we have
--already added to this result set and joins those to the "next" rows
--(as defined by our ordering in the first cte). Then we output
--those rows and do some calculations to determine if this row is
--"good" or not. If it is "good" we set it's timestamp as the
--previous good row timestamp so that rows that come after this one
--can use it to determine if they are good or not. If a row is "bad"
--we just forward along the last known good timestamp to the next row.
--
--We also determine if a row is good by checking if the last good row
--timestamp plus 30 minutes is less than or equal to the current row's
--timestamp. If it is then the row is good.
select e2.id
, e2.timestamp
, e2.rn
, last_good_timestamp.timestamp
, case
when dateadd(mi, 30, last_good_timestamp.timestamp) <= e2.timestamp then 1
else 0
end
from previousGoodTimestamps e1
inner join orderedEvents e2 on e2.id = e1.id and e2.rn = e1.rn + 1
--I used a cross apply here to calculate the last good row timestamp
--once. I could have used two identical subqueries above in the select
--and case statements, but I would rather not duplicate the code.
cross apply
(
select case
when e1.is_good = 1 then e1.timestamp --if the last row is good, just use it's timestamp
else e1.prev_good_timestamp --the last row was bad, forward on what it had for the last good timestamp
end as timestamp
) last_good_timestamp
)
select *
from previousGoodTimestamps
where is_good = 1 --only take the "good" rows
这里有一些更复杂的东西的 MSDN 链接:
我是 SQL 的新手,正在努力从几个非常大的表中提取一些数据进行分析。数据基本上是系统上资产的触发事件。这些事件都有一个我关心的 created_date (日期时间)字段。
我能够将下面的查询放在一起以获得我需要的数据 (YAY):
SELECT
event.efkey
,event.e_id
,event.e_key
,l.l_name
,event.created_date
,asset.a_id
,asset.asset_name
FROM event
LEFT JOIN asset
ON event.a_key = asset.a_key
LEFT JOIN l
ON event.l_key = l.l_key
WHERE event.e_key IN (350, 352, 378)
ORDER BY asset.a_id, event.created_date
然而,虽然这为我提供了我想要的特定事件的数据,但我还有另一个问题。资产可以反复触发这些事件,这可能会导致我正在查看的内容出现大量 "false positives"。
我需要做的是检查上面查询的结果集,并删除发生时间小于 N 分钟的资产的任何事件(对于本例来说是 30 分钟)。因此,如果 asset_ID 相同并且 event.created_date 在集合中该资产的另一个事件的 30 分钟内,那么我希望将其删除。例如:
以下记录
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-01 12:35:31
a_id 1124 created 2016-02-01 12:40:33
a_id 1124 created 2016-02-01 12:45:42
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
我只想 return:
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
我尝试引用 this and
如有任何帮助,我们将不胜感激!如果你能为新手简化它(或 link 解释),那也会有帮助!
-- Sample data.
declare @Samples as Table ( Id Int Identity, A_Id Int, CreatedDate DateTime );
insert into @Samples ( A_Id, CreatedDate ) values
( 1124, '2016-02-01 12:30:30' ),
( 1124, '2016-02-01 12:35:31' ),
( 1124, '2016-02-01 12:40:33' ),
( 1124, '2016-02-01 12:45:42' ),
( 1124, '2016-02-02 12:30:30' ),
( 1124, '2016-02-02 13:00:30' ),
( 1125, '2016-02-01 12:30:30' );
select * from @Samples;
-- Calculate the windows of 30 minutes before and after each CreatedDate and check for conflicts with other rows.
with Ranges as (
select Id, A_Id, CreatedDate,
DateAdd( minute, -30, S.CreatedDate ) as RangeStart, DateAdd( minute, 30, S.CreatedDate ) as RangeEnd
from @Samples as S )
select Id, A_Id, CreatedDate, RangeStart, RangeEnd,
-- Check for a conflict with another row with:
-- the same A_Id value and an earlier CreatedDate that falls inside the +/-30 minute range.
case when exists ( select 42 from @Samples where A_Id = R.A_Id and CreatedDate < R.CreatedDate and R.RangeStart < CreatedDate and CreatedDate < R.RangeEnd ) then 1
else 0 end as Conflict
from Ranges as R;
这是一个比最初看起来更棘手的问题。困难的部分是捕获前一个好行并删除下一个坏行,但不允许这些坏行影响下一行是否好。这是我想出的。我试图解释代码中的注释是怎么回事。
--sample data since I don't have your table structure and your original query won't work for me
declare @events table
(
id int,
timestamp datetime
)
--note that I changed some of your sample data to test some different scenarios
insert into @events values( 1124, '2016-02-01 12:30:30')
insert into @events values( 1124, '2016-02-01 12:35:31')
insert into @events values( 1124, '2016-02-01 12:40:33')
insert into @events values( 1124, '2016-02-01 13:05:42')
insert into @events values( 1124, '2016-02-02 12:30:30')
insert into @events values( 1124, '2016-02-02 13:00:30')
insert into @events values( 1115, '2016-02-01 12:30:30')
--using a cte here to split the result set of your query into groups
--by id (you would want to partition by whatever criteria you use
--to determine that rows are talking about the same event)
--the row_number function gets the row number for each row within that
--id partition
--the over clause specifies how to break up the result set into groups
--(partitions) and what order to put the rows in within that group so
--that the numbering stays consistant
;with orderedEvents as
(
select id, timestamp, row_number() over (partition by id order by timestamp) as rn
from @events
--you would replace @events here with your query
)
--using a second recursive cte here to determine which rows are "good"
--and which ones are not.
, previousGoodTimestamps as
(
--this is the "seeding" part of the recursive cte where I pick the
--first rows of each group as being a desired result. Since they
--are the first in each group, I know they are good. I also assign
--their timestamp as the previous good timestamp since I know that
--this row is good.
select id, timestamp, rn, timestamp as prev_good_timestamp, 1 as is_good
from orderedEvents
where rn = 1
union all
--this is the recursive part of the cte. It takes the rows we have
--already added to this result set and joins those to the "next" rows
--(as defined by our ordering in the first cte). Then we output
--those rows and do some calculations to determine if this row is
--"good" or not. If it is "good" we set it's timestamp as the
--previous good row timestamp so that rows that come after this one
--can use it to determine if they are good or not. If a row is "bad"
--we just forward along the last known good timestamp to the next row.
--
--We also determine if a row is good by checking if the last good row
--timestamp plus 30 minutes is less than or equal to the current row's
--timestamp. If it is then the row is good.
select e2.id
, e2.timestamp
, e2.rn
, last_good_timestamp.timestamp
, case
when dateadd(mi, 30, last_good_timestamp.timestamp) <= e2.timestamp then 1
else 0
end
from previousGoodTimestamps e1
inner join orderedEvents e2 on e2.id = e1.id and e2.rn = e1.rn + 1
--I used a cross apply here to calculate the last good row timestamp
--once. I could have used two identical subqueries above in the select
--and case statements, but I would rather not duplicate the code.
cross apply
(
select case
when e1.is_good = 1 then e1.timestamp --if the last row is good, just use it's timestamp
else e1.prev_good_timestamp --the last row was bad, forward on what it had for the last good timestamp
end as timestamp
) last_good_timestamp
)
select *
from previousGoodTimestamps
where is_good = 1 --only take the "good" rows
这里有一些更复杂的东西的 MSDN 链接: