TSQL - 运行 初始查询中 "duplicates"/误报的日期比较?

TSQL - Run date comparison for "duplicates"/false positives on initial query?

我是 SQL 的新手,正在努力从几个非常大的表中提取一些数据进行分析。数据基本上是系统上资产的触发事件。这些事件都有一个我关心的 created_date (日期时间)字段。

我能够将下面的查询放在一起以获得我需要的数据 (YAY):

SELECT 
         event.efkey
        ,event.e_id
        ,event.e_key
        ,l.l_name
        ,event.created_date
        ,asset.a_id
        ,asset.asset_name
  FROM event
  LEFT JOIN asset
         ON event.a_key = asset.a_key
  LEFT JOIN l
         ON event.l_key = l.l_key


  WHERE event.e_key IN (350, 352, 378)

  ORDER BY asset.a_id, event.created_date

然而,虽然这为我提供了我想要的特定事件的数据,但我还有另一个问题。资产可以反复触发这些事件,这可能会导致我正在查看的内容出现大量 "false positives"。

我需要做的是检查上面查询的结果集,并删除发生时间小于 N 分钟的资产的任何事件(对于本例来说是 30 分钟)。因此,如果 asset_ID 相同并且 event.created_date 在集合中该资产的另一个事件的 30 分钟内,那么我希望将其删除。例如:

以下记录

a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-01 12:35:31
a_id 1124 created 2016-02-01 12:40:33
a_id 1124 created 2016-02-01 12:45:42
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30

我只想 return:

a_id 1124 created 2016-02-01 12:30:30 
a_id 1124 created 2016-02-02 12:30:30 
a_id 1124 created 2016-02-02 13:00:30 
a_id 1115 created 2016-02-01-12:30:30

我尝试引用 this and ,但我无法使那里的概念对我有用。我知道我可能需要做一个 SELECT * FROM(我现有的查询)但我似乎无法做到这一点而不会以大量 "multi-part identifier can't be bound" 错误告终(而且我没有创建临时表的经验,到目前为止我的尝试都失败了)。我也不太确定如何使用 DATEDIFF 作为日期过滤函数。

如有任何帮助,我们将不胜感激!如果你能为新手简化它(或 link 解释),那也会有帮助!

-- Sample data.
declare @Samples as Table ( Id Int Identity, A_Id Int, CreatedDate DateTime );
insert into @Samples ( A_Id, CreatedDate ) values
  ( 1124, '2016-02-01 12:30:30' ),
  ( 1124, '2016-02-01 12:35:31' ),
  ( 1124, '2016-02-01 12:40:33' ),
  ( 1124, '2016-02-01 12:45:42' ),
  ( 1124, '2016-02-02 12:30:30' ),
  ( 1124, '2016-02-02 13:00:30' ),
  ( 1125, '2016-02-01 12:30:30' );
select * from @Samples;

-- Calculate the windows of 30 minutes before and after each   CreatedDate   and check for conflicts with other rows.
with Ranges as (
  select Id, A_Id, CreatedDate,
    DateAdd( minute, -30, S.CreatedDate ) as RangeStart, DateAdd( minute, 30, S.CreatedDate ) as RangeEnd
    from @Samples as S )
  select Id, A_Id, CreatedDate, RangeStart, RangeEnd,
    -- Check for a conflict with another row with:
    --   the same   A_Id   value and an earlier   CreatedDate   that falls inside the +/-30 minute range.
    case when exists ( select 42 from @Samples where A_Id = R.A_Id and CreatedDate < R.CreatedDate and R.RangeStart < CreatedDate and CreatedDate < R.RangeEnd ) then 1
      else 0 end as Conflict
    from Ranges as R;

这是一个比最初看起来更棘手的问题。困难的部分是捕获前一个好行并删除下一个坏行,但不允许这些坏行影响下一行是否好。这是我想出的。我试图解释代码中的注释是怎么回事。

--sample data since I don't have your table structure and your original query won't work for me
declare @events table
(
  id int,
  timestamp datetime
)

--note that I changed some of your sample data to test some different scenarios
insert into @events values( 1124, '2016-02-01 12:30:30')
insert into @events values( 1124, '2016-02-01 12:35:31')
insert into @events values( 1124, '2016-02-01 12:40:33')
insert into @events values( 1124, '2016-02-01 13:05:42')
insert into @events values( 1124, '2016-02-02 12:30:30')
insert into @events values( 1124, '2016-02-02 13:00:30')
insert into @events values( 1115, '2016-02-01 12:30:30')

--using a cte here to split the result set of your query into groups
--by id (you would want to partition by whatever criteria you use
--to determine that rows are talking about the same event)
--the row_number function gets the row number for each row within that 
--id partition
--the over clause specifies how to break up the result set into groups 
--(partitions) and what order to put the rows in within that group so 
--that the numbering stays consistant
;with orderedEvents as
(
    select id, timestamp, row_number() over (partition by id order by timestamp) as rn
    from @events
    --you would replace @events here with your query
)
--using a second recursive cte here to determine which rows are "good"
--and which ones are not.  
, previousGoodTimestamps as 
(
    --this is the "seeding" part of the recursive cte where I pick the
    --first rows of each group as being a desired result.  Since they 
    --are the first in each group, I know they are good.  I also assign
    --their timestamp as the previous good timestamp since I know that 
    --this row is good.
    select id, timestamp, rn, timestamp as prev_good_timestamp, 1 as is_good
    from orderedEvents
    where rn = 1

    union all

    --this is the recursive part of the cte.  It takes the rows we have
    --already added to this result set and joins those to the "next" rows
    --(as defined by our ordering in the first cte).  Then we output
    --those rows and do some calculations to determine if this row is 
    --"good" or not.  If it is "good" we set it's timestamp as the
    --previous good row timestamp so that rows that come after this one 
    --can use it to determine if they are good or not.  If a row is "bad"
    --we just forward along the last known good timestamp to the next row.
    --
    --We also determine if a row is good by checking if the last good row
    --timestamp plus 30 minutes is less than or equal to the current row's
    --timestamp.  If it is then the row is good.
    select e2.id
        , e2.timestamp
        , e2.rn
        , last_good_timestamp.timestamp
        , case
            when dateadd(mi, 30, last_good_timestamp.timestamp) <= e2.timestamp then 1
            else 0
          end
    from previousGoodTimestamps e1
    inner join orderedEvents e2 on e2.id = e1.id and e2.rn = e1.rn + 1
    --I used a cross apply here to calculate the last good row timestamp
    --once.  I could have used two identical subqueries above in the select
    --and case statements, but I would rather not duplicate the code.
    cross apply
    (
        select case 
                 when e1.is_good = 1 then e1.timestamp --if the last row is good, just use it's timestamp
                 else e1.prev_good_timestamp --the last row was bad, forward on what it had for the last good timestamp
               end as timestamp
    ) last_good_timestamp
)
select *
from previousGoodTimestamps
where is_good = 1 --only take the "good" rows

这里有一些更复杂的东西的 MSDN 链接: