按其他列值获取列中每个唯一值的前 x% 行

Question

Table "tags":

Source  Target      Weight
#003    blitzkrank  0.83
#003    deutsch     0.7
#003    brammen     0.57
#003    butzfrauen  0.55
#003    solaaaa     0.5
#003    moments     0.3
college scandal     1.15
college prosecutors 0.82
college students    0.41
college usc         0.33
college full house  0.17
college friends     0.08
college house       0.5
college friend      0.01

table 在 "Source" 列中有 5.600.000 行和约 91.000 个唯一条目。

对于 "Source" 和 "Target" 中的每个唯一值，我需要按权重（table 按 "Source"（升序）和 "Weight"（降序）排序。

如果行具有相同的 "Weight" 按字母顺序取行。
如果 x% == 0 至少取一行。

因为会有重复（例如 "Source = "college" 将产生至少一个重复的行，如 "Target" = "scandal"），如果可能，应该删除重复的条目。否则不是大不了。

"Source" 的计算：

6 rows where Source = "#003", 6 * 0.2 = 1.2 = take 1 row
8 rows where Source = "college", 8 * 0.2 = 1.6 = take 2 rows

"Source" 所需的结果 table：

Source  Target      Weight
#003    blitzkrank  0.83
college scandal     1.15
college prosecutors 0.82

如何在 SQL 的 SQLite 数据库中做到这一点？

Answer 1

如果您想要 source 的样本：

select t.*
from (select t.*,
             row_number() over (partition by source order by weight desc, target) as seqnum,
             count(*) over (partition by source) as cnt
      from t
     ) t
where seqnum = 1 or  -- always at least one row
      seqnum <= round(cnt * 0.2);

根据您的示例，我认为这就是您想要的。您可以为 target.

构建类似的查询

按其他列值获取列中每个唯一值的前 x% 行

Get top x% rows for every unique value in column by other column value

sql

sqlite

percentage

greatest-n-per-group