按其他列值获取列中每个唯一值的前 x% 行
Get top x% rows for every unique value in column by other column value
Table "tags":
Source Target Weight
#003 blitzkrank 0.83
#003 deutsch 0.7
#003 brammen 0.57
#003 butzfrauen 0.55
#003 solaaaa 0.5
#003 moments 0.3
college scandal 1.15
college prosecutors 0.82
college students 0.41
college usc 0.33
college full house 0.17
college friends 0.08
college house 0.5
college friend 0.01
table 在 "Source" 列中有 5.600.000 行和约 91.000 个唯一条目。
对于 "Source" 和 "Target" 中的每个唯一值,我需要按权重(table 按 "Source"(升序)和 "Weight"(降序)排序。
- 如果行具有相同的 "Weight" 按字母顺序取行。
- 如果 x% == 0 至少取一行。
因为会有重复(例如 "Source = "college" 将产生至少一个重复的行,如 "Target" = "scandal"),如果可能,应该删除重复的条目。否则不是大不了。
"Source" 的计算:
6 rows where Source = "#003", 6 * 0.2 = 1.2 = take 1 row
8 rows where Source = "college", 8 * 0.2 = 1.6 = take 2 rows
"Source" 所需的结果 table:
Source Target Weight
#003 blitzkrank 0.83
college scandal 1.15
college prosecutors 0.82
如何在 SQL 的 SQLite 数据库中做到这一点?
如果您想要 source
的样本:
select t.*
from (select t.*,
row_number() over (partition by source order by weight desc, target) as seqnum,
count(*) over (partition by source) as cnt
from t
) t
where seqnum = 1 or -- always at least one row
seqnum <= round(cnt * 0.2);
根据您的示例,我认为这就是您想要的。您可以为 target
.
构建类似的查询
Table "tags":
Source Target Weight
#003 blitzkrank 0.83
#003 deutsch 0.7
#003 brammen 0.57
#003 butzfrauen 0.55
#003 solaaaa 0.5
#003 moments 0.3
college scandal 1.15
college prosecutors 0.82
college students 0.41
college usc 0.33
college full house 0.17
college friends 0.08
college house 0.5
college friend 0.01
table 在 "Source" 列中有 5.600.000 行和约 91.000 个唯一条目。
对于 "Source" 和 "Target" 中的每个唯一值,我需要按权重(table 按 "Source"(升序)和 "Weight"(降序)排序。
- 如果行具有相同的 "Weight" 按字母顺序取行。
- 如果 x% == 0 至少取一行。
因为会有重复(例如 "Source = "college" 将产生至少一个重复的行,如 "Target" = "scandal"),如果可能,应该删除重复的条目。否则不是大不了。
"Source" 的计算:
6 rows where Source = "#003", 6 * 0.2 = 1.2 = take 1 row
8 rows where Source = "college", 8 * 0.2 = 1.6 = take 2 rows
"Source" 所需的结果 table:
Source Target Weight
#003 blitzkrank 0.83
college scandal 1.15
college prosecutors 0.82
如何在 SQL 的 SQLite 数据库中做到这一点?
如果您想要 source
的样本:
select t.*
from (select t.*,
row_number() over (partition by source order by weight desc, target) as seqnum,
count(*) over (partition by source) as cnt
from t
) t
where seqnum = 1 or -- always at least one row
seqnum <= round(cnt * 0.2);
根据您的示例,我认为这就是您想要的。您可以为 target
.