识别 sql 中的重复原始数据(贷款)
Identify repeating raws (loans) in sql
我有一个 table 这样的:
Client Branch Amount Date
1 2 1500 1.1.14
1 2 1400 3.1.14
1 3 1500 1.1.14
1 4 300 7.1.14
1 5 1500 1.1.14
------------------------------
2 2 300 1.1.14
2 2 300 1.1.14
2 5 300 1.1.14
2 3 400 4.1.14
------------------------------
3 2 300 1.1.14
3 2 300 1.1.14
3 5 300 1.1.14
3 5 300 1.1.14
3 3 400 4.1.14
4 2 300 1.1.14
4 2 300 1.1.14
4 5 300 1.1.14
4 5 300 1.1.14
4 5 300 1.1.14
我想要的输出应该是这样的:
Client Branch Amount Date Ind Loan_Distinct_Num
1 2 1500 1.1.14 0 1
1 2 1400 3.1.14 0 2
1 3 1500 1.1.14 1 1
1 4 300 7.1.14 0 3
1 5 1500 1.1.14 1 1
-------------------------------------------------
2 2 300 1.1.14 0 1
2 2 300 1.1.14 0 2
2 5 300 1.1.14 1 2
2 3 400 4.1.14 0 3
--------------------------------------------------
3 2 300 1.1.14 0 1
3 2 300 1.1.14 0 2
3 5 300 1.1.14 1 1
3 5 300 1.1.14 1 2
3 3 400 4.1.14 0 3
------------------------------------------------
4 2 300 1.1.14 0 1
4 2 300 1.1.14 0 2
4 5 300 1.1.14 1 1
4 5 300 1.1.14 1 2
4 5 300 1.1.14 0 3
那我想做什么? (注:那些记录只是样本数据)
嗯,这些是规则:
客户已从同一家银行的一个分行转移到另一个分行。问题是分支机构多次给他写数据。我想确定重复的贷款。需要两个步骤:
第一步:
假设:Same_Amount + Same_Date + Diffrent 日期 ---> Ind = 1 在 after 的记录上先生.
工业领域如何运作?
例如:
在 client = 1 的分区中,1500 的数量在同一日期和不同的分支上重复了 3 次,但只有最后两条记录的详细信息将获得 Ind 的“1”值,第一个将获得 Ind = 0,因为它不是重复贷款,这是第一次在数据中出现具有此金额和日期的记录。
在 client = 2 的情况下,branch = 2 有两条记录,而 branch = 5 只有一条记录,所以在这种情况下,我假设 branch = 2 的最后一条记录被重复。
如果 client = 3 在 branch = 2 中有两条记录,在 branch = 5 中有两条记录,所以在这种情况下,我假设来自 branch 2 的两次贷款都被重复了。
如果客户 = 4,那么它将与客户 3 相同,但还有另一条记录,但我会认为这是一条新记录,因为我没有额外的过去贷款与她沟通。
第 2 步:我想为每个客户创建自己独特的贷款编号
关于如何解决这个又简单的问题有什么帮助吗?
评论:sql-server 2008.
首先 - 将您的数据设置为 table。我添加了一个标识列 ID,因此我们有一些东西可以排序 - 你在评论中指定你的数据是按特定顺序排列的。
declare @data table (ID int identity(1,1), Client int, Branch int, Amount int, [Date] date);
insert into @data values
(1,2, 1500,'2014-01-01'),
(1,2, 1400,'2014-03-01'),
(1,3, 1500,'2014-01-01'),
(1,4, 300,'2014-07-01'),
(1,5, 1500,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,5, 300,'2014-01-01'),
(2,3, 400,'2014-04-01'),
(3,2, 300,'2014-01-01'),
(3,2, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,3, 400,'2014-04-01'),
(4,2, 300,'2014-01-01'),
(4,2, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01');
这里是我们查询的地方:
--In the first cte, we take all the data, and partition it up into individual loans (partition by Client, Amount, Date).
with cte1 as (
select *, ROW_NUMBER() over (partition by Client, Amount, Date order by ID) as rowno from @data
), cte2 as (
--in this cte, we get a list of distinct loans. We will use another rownumber in a bit to find our Loan_Distinct_Num
select distinct Client, Amount, [Date] from @data
)
select cte1.Client, cte1.Branch, cte1.Amount, cte1.[Date]
-- If rowno = 1, it's the first instance of that combination
, case when rowno = 1 then 0 else 1 end as ind
, b.Loan_Distinct_Num
from cte1
left join (select cte2.*, ROW_NUMBER() over (partition by Client order by [Date]) as Loan_Distinct_Num
-- This is where our distinct loan number comes from
from cte2
) as b
on b.Client = cte1.Client and b.Amount = cte1.Amount and b.[Date] = cte1.[Date]
order by ID
如果存在具有不同分支 # 的先前记录(请参阅第 7 行),则 ind 应该仅为 1 时,这是一个答案。此外,使用 dense_rank 在 loan_distinct_num 中按 amount/date 对贷款进行分组。该列的逻辑似乎更复杂 - 如果这是一次性修复,我可能会使用游标遍历 table 并应用一些更复杂的逻辑来填充该列,而不是尝试计算它在查询中。
-- sample data
declare @data table (ID int identity(1,1), Client int, Branch int, Amount int, [Date] date);
insert into @data values
(1,2, 1500,'2014-01-01'),
(1,2, 1400,'2014-03-01'),
(1,3, 1500,'2014-01-01'),
(1,4, 300,'2014-07-01'),
(1,5, 1500,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,5, 300,'2014-01-01'),
(2,3, 400,'2014-04-01'),
(3,2, 300,'2014-01-01'),
(3,2, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,3, 400,'2014-04-01'),
(4,2, 300,'2014-01-01'),
(4,2, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01');
-- query
select client, branch, amount, date,
case when exists (select * from @data t2 where client = tbl.client and branch <> tbl.branch and amount = tbl.amount and date = tbl.date and id < tbl.id) then 1 else 0 end as ind,
DENSE_RANK() over (partition by client order by date, amount asc) as loan_disinct_num
from @data tbl
order by id;
我有一个 table 这样的:
Client Branch Amount Date
1 2 1500 1.1.14
1 2 1400 3.1.14
1 3 1500 1.1.14
1 4 300 7.1.14
1 5 1500 1.1.14
------------------------------
2 2 300 1.1.14
2 2 300 1.1.14
2 5 300 1.1.14
2 3 400 4.1.14
------------------------------
3 2 300 1.1.14
3 2 300 1.1.14
3 5 300 1.1.14
3 5 300 1.1.14
3 3 400 4.1.14
4 2 300 1.1.14
4 2 300 1.1.14
4 5 300 1.1.14
4 5 300 1.1.14
4 5 300 1.1.14
我想要的输出应该是这样的:
Client Branch Amount Date Ind Loan_Distinct_Num
1 2 1500 1.1.14 0 1
1 2 1400 3.1.14 0 2
1 3 1500 1.1.14 1 1
1 4 300 7.1.14 0 3
1 5 1500 1.1.14 1 1
-------------------------------------------------
2 2 300 1.1.14 0 1
2 2 300 1.1.14 0 2
2 5 300 1.1.14 1 2
2 3 400 4.1.14 0 3
--------------------------------------------------
3 2 300 1.1.14 0 1
3 2 300 1.1.14 0 2
3 5 300 1.1.14 1 1
3 5 300 1.1.14 1 2
3 3 400 4.1.14 0 3
------------------------------------------------
4 2 300 1.1.14 0 1
4 2 300 1.1.14 0 2
4 5 300 1.1.14 1 1
4 5 300 1.1.14 1 2
4 5 300 1.1.14 0 3
那我想做什么? (注:那些记录只是样本数据)
嗯,这些是规则: 客户已从同一家银行的一个分行转移到另一个分行。问题是分支机构多次给他写数据。我想确定重复的贷款。需要两个步骤:
第一步: 假设:Same_Amount + Same_Date + Diffrent 日期 ---> Ind = 1 在 after 的记录上先生.
工业领域如何运作?
例如: 在 client = 1 的分区中,1500 的数量在同一日期和不同的分支上重复了 3 次,但只有最后两条记录的详细信息将获得 Ind 的“1”值,第一个将获得 Ind = 0,因为它不是重复贷款,这是第一次在数据中出现具有此金额和日期的记录。
在 client = 2 的情况下,branch = 2 有两条记录,而 branch = 5 只有一条记录,所以在这种情况下,我假设 branch = 2 的最后一条记录被重复。
如果 client = 3 在 branch = 2 中有两条记录,在 branch = 5 中有两条记录,所以在这种情况下,我假设来自 branch 2 的两次贷款都被重复了。
如果客户 = 4,那么它将与客户 3 相同,但还有另一条记录,但我会认为这是一条新记录,因为我没有额外的过去贷款与她沟通。
第 2 步:我想为每个客户创建自己独特的贷款编号
关于如何解决这个又简单的问题有什么帮助吗?
评论:sql-server 2008.
首先 - 将您的数据设置为 table。我添加了一个标识列 ID,因此我们有一些东西可以排序 - 你在评论中指定你的数据是按特定顺序排列的。
declare @data table (ID int identity(1,1), Client int, Branch int, Amount int, [Date] date);
insert into @data values
(1,2, 1500,'2014-01-01'),
(1,2, 1400,'2014-03-01'),
(1,3, 1500,'2014-01-01'),
(1,4, 300,'2014-07-01'),
(1,5, 1500,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,5, 300,'2014-01-01'),
(2,3, 400,'2014-04-01'),
(3,2, 300,'2014-01-01'),
(3,2, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,3, 400,'2014-04-01'),
(4,2, 300,'2014-01-01'),
(4,2, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01');
这里是我们查询的地方:
--In the first cte, we take all the data, and partition it up into individual loans (partition by Client, Amount, Date).
with cte1 as (
select *, ROW_NUMBER() over (partition by Client, Amount, Date order by ID) as rowno from @data
), cte2 as (
--in this cte, we get a list of distinct loans. We will use another rownumber in a bit to find our Loan_Distinct_Num
select distinct Client, Amount, [Date] from @data
)
select cte1.Client, cte1.Branch, cte1.Amount, cte1.[Date]
-- If rowno = 1, it's the first instance of that combination
, case when rowno = 1 then 0 else 1 end as ind
, b.Loan_Distinct_Num
from cte1
left join (select cte2.*, ROW_NUMBER() over (partition by Client order by [Date]) as Loan_Distinct_Num
-- This is where our distinct loan number comes from
from cte2
) as b
on b.Client = cte1.Client and b.Amount = cte1.Amount and b.[Date] = cte1.[Date]
order by ID
如果存在具有不同分支 # 的先前记录(请参阅第 7 行),则 ind 应该仅为 1 时,这是一个答案。此外,使用 dense_rank 在 loan_distinct_num 中按 amount/date 对贷款进行分组。该列的逻辑似乎更复杂 - 如果这是一次性修复,我可能会使用游标遍历 table 并应用一些更复杂的逻辑来填充该列,而不是尝试计算它在查询中。
-- sample data
declare @data table (ID int identity(1,1), Client int, Branch int, Amount int, [Date] date);
insert into @data values
(1,2, 1500,'2014-01-01'),
(1,2, 1400,'2014-03-01'),
(1,3, 1500,'2014-01-01'),
(1,4, 300,'2014-07-01'),
(1,5, 1500,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,2, 300,'2014-01-01'),
(2,5, 300,'2014-01-01'),
(2,3, 400,'2014-04-01'),
(3,2, 300,'2014-01-01'),
(3,2, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,5, 300,'2014-01-01'),
(3,3, 400,'2014-04-01'),
(4,2, 300,'2014-01-01'),
(4,2, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01'),
(4,5, 300,'2014-01-01');
-- query
select client, branch, amount, date,
case when exists (select * from @data t2 where client = tbl.client and branch <> tbl.branch and amount = tbl.amount and date = tbl.date and id < tbl.id) then 1 else 0 end as ind,
DENSE_RANK() over (partition by client order by date, amount asc) as loan_disinct_num
from @data tbl
order by id;