如何在sql中为测试组找到相似分布的对照组?

How to find up similarly distributed control group for test group in sql?

这项任务非常艰巨。我每个月有大约 50k+ 用户,我想从整个用户池(大约 5000 万用户)中匹配类似分布的相同大小的控制组。为了获得类似的分布,我有一些分类和数字特征。分类特征将只是 inner joined。数值特征想四舍五入,但那是最大的问题。

这是我的代码:

with pl_subs as( -- in the cte I 
    select   al.*
  ,ROW_NUMBER() OVER(PARTITION BY 
      al.device_type
     ,al.report_mnth
     ,round(al.days_to_LAST_FLASH_DTTM, -1)
     ,round(al.LT_month, -1)
     ,round(al.REVC, -1)
     ,round(al.usg_in, -2)
     ,round(al.usg_AC, -1)
     ORDER BY null) AS RN
    from ai_pl_SUBS test_gr
    inner join  ai_SUBS_MONTH_CLR al 
    on al.cust_id = test_gr.cust_id
    and al.report_mnth = test_gr.REGISTERED_mnth
    where al.report_mnth  = '2017-11' and test_gr.REGISTERED_mnth = '2017-11'
)
sel count(1) -- just to count from (
sel al.cust_id, pl_subs.rn rn_pl
 ,ROW_NUMBER() OVER(PARTITION BY 
  pl_subs.device_type
 ,pl_subs.report_mnth
 ,pl_subs.MCID
 ,round(pl_subs.days_to_LF, -1)
 ,round(pl_subs.LT_month, -1)
 ,round(pl_subs.REVC, -1)
 ,round(pl_subs.usg_in, -2)
 ,round(pl_subs.usg_AC, -1)
 ORDER BY null) AS RN
from pl_subs
inner join ai_SUBS_MONTH_CLR al on 

-- 2 categorilal features
pl_subs.device_type =  al.device_type
and pl_subs.report_mnth = al.report_mnth

-- 5 numerical features
and round(pl_subs.days_to_LF, -1) = Round(al.days_to_LF, -1)
and round(pl_subs.LT_month, -1) = Round(al.LT_month, -1)
and round(pl_subs.REVC, -1) = Round(al.REVC, -1)
and round(pl_subs.usg_in, -2) = Round(al.usg_in, -2) 
and round(pl_subs.usg_AC, -1) = Round(al.usg_AC, -1) 
-- in the control group shouldnot be any cust_id from the test group
where al.cust_id not in (select cust_id from ai_pl_SUBS)
    and al.report_mnth = '2017-11'
    ) _out where rn <=  rn_pl 
-- each 7 features determines strata. So I need to have so many cust as I have in appropriate  strata in the test group

测试组的人数值更高。在上面的代码中,我四舍五入到十,所以中间假脱机不会太大,但结果只有 36k 用户,而不是预期的 50k。我四舍五入 - 查询将因假脱机问题而失败

相似分布 - 数值的平均值相等

我有代码错误吗?如何修改代码才能将客户多次包含到分层中?

我上面的代码有一些问题:

1) round(pl_subs.LT_month , -1) = Round(al.LT_month , -1) -- 在广泛分布的值上使用回合最终会导致在寻找探测器控制客户端进行测试时出现问题。所以只是用例:

case when LT_month <= 4 then '0'
     when LT_month <= 8 then '1'
     when LT_month <= 12 then '2'
     when LT_month <= 17 then '3'
     when LT_month <= 24 then '4'
     when LT_month <= 36 then '5'
     when LT_month <= 56 then '6'
     when LT_month <= 83 then '7'
     when LT_month <= 96 then '8'

预计算和使用索引将使对 运行 的查询非常快。但不要过度

2) CTE 应仅包含阶层 + 该组应包含的人数:

 with pl_subs as( -- in the cte I 
        select  
          al.device_type
         ,al.report_mnth
-- rounds should be changed
         ,round(al.LT_month, -1)
         ,round(al.REVC, -1)
         ,round(al.usg_in, -2)
         ,round(al.usg_AC, -1)
, count(1) as rn
from ai_pl_SUBS test_gr
    inner join  ai_SUBS_MONTH_CLR al 
    on al.cust_id = test_gr.cust_id
    and al.report_mnth = test_gr.REGISTERED_mnth
    where al.report_mnth  = '2017-11' and test_gr.REGISTERED_mnth = '2017-11'
group by 1
)

sel subs_id, report_mnth from (
sel al.subs_id, al.report_mnth, pl_subs.max_rn max_rn
 ,ROW_NUMBER() OVER(PARTITION BY 
  pl_subs.device_type
 ,pl_subs.report_mnth
 ,pl_subs.segment
 ORDER BY null) AS RN
from pl_subs
inner join UAT_DM.ai_SUBS_MONTH_CLR al on 
pl_subs.device_type =  al.device_type
and pl_subs.report_mnth = al.report_mnth
and pl_subs.segment = al.segment
where al.subs_id not in (select subs_id from UAT_DM.ai_pl_SUBS)
    and al.report_mnth = '2017-11'

) _out where rn <=  max_rn;