在 SQL 列中查找相似条目并按频率排名

Question

我的 SQLite 数据库中有一列 10k URI。我想确定这些 URI 中的哪些是同一网站的子域。

例如，对于给定的集合...

 1. daiquiri.rum.cu
 2. mojito.rum.cu
 3. cubalibre.rum.cu
 4. americano.campari.it
 5. negroni.campari.it
 6. hemingway.com

...我想运行一个 returns:

的查询

Website       | Occurrences
----------------------------
rum.cu        |     3
campari.it    |     2
hemingway.com |     1

即匹配到的域名/模式，按照在数据库中出现的次数排序。

我将使用的启发式方法是：对于每个具有 3 个以上域的 URI，将第一个域替换为“%'and execute the pseudoquery: COUNT(uris from website where uris LIKE '%.remainderofmyuri”）。

请注意，我不太关心执行速度（事实上，一点也不关心）。条目数在10k-100k范围内。

Answer 1

select x.site, count(*)
from mytable a
inner join 
(
    select 'rum.cu' as site
    union all select 'campari.it'
    union all select 'hemingway.com'
) x on a.url like '%' + x.site + '%'
group by x.site -- EDIT I missed out the GROUP BY on the first go - sorry!

（这就是我在 SQL-Server 中的做法；不确定 SQLite 在语法上有何不同。）

'mytable' 是您的 table，它有一个名为 url 的列，其中包含 'mojito.rum.cu' 等。我没有输入“%”。之类的，因为那样会错过 hemmingway.com。但是，您可以改用此行来解决这个问题：

) x on a.url like '%.' + x.site + '%' or a.url = x.site

你可能不需要最后的 + '%' - 我把它放进去是为了捕捉 url 像 'hemingway.com/some-page.html。如果你没有这样的 url，你可以跳过它。

编辑动态名称

select x.site, count(*)
from mytable a
inner join 
(
    select distinct ltrim(url, instr(url, '.')) as site
    from mytable
    where url like '%.%.%'
    union
    select distinct url
    from mytable
    where url like '%.%' and url not like '%.%.%'
) x on a.url like '%' + x.site + '%'
group by x.site

类似的东西应该可以做到。我还没有测试过 INSTR() 函数是否正确。您可能需要在测试时从它生成的偏移量中加或减 1。它可能不是最快的查询，但应该可以。

Answer 2

唯一的问题是找到域。为了找到一个算法，想象你的 url 前面有一个额外的点（比如 '.negroni.campari.it' 和 '.hemingway.com'）。你看，它始终是右边 second 点之后的字符串。我们所要做的就是寻找那个出现并剥离字符串的一部分。然而，不幸的是，SQLite 的字符串函数相当差。没有函数可以让你第二次出现一个点，即使从左边数起也不行。所以该算法对大多数 dbms 来说都很好，但它不适用于 SQLite。我们需要另一种方法。（无论如何，我正在写这篇文章，以展示通常如何解决这个问题。）

这里是SQLite解决方案：域和子域的区别在于域中只有一个点，而子域至少有两个。所以当有多个点时，我们必须删除包括第一个点在内的第一部分才能得到域。此外，我们希望这甚至适用于像 abc.def.geh.ijk.com 这样的子域，因此我们必须递归地执行此操作。

with recursive cte(uri) as 
(
  select uri from uris
  union all
  select substr(uri, instr(uri, '.') + 1) as uri from cte where instr(uri, '.') > 0
)
select uri, count(*)
from cte
where length(uri) = length(replace(uri,'.','')) + 1 -- domains only
group by uri
order by count(*) desc;

这里我们从 'daiquiri.rum.cu' 等生成 'daiquiri.rum.cu' 和 'rum.cu' 和 'cu'。所以对于每个 uri 我们得到域（这里 'rum.cu'）和一些其他的字符串。最后我们用 LENGTH 过滤得到那些只有一个点的字符串——域。剩下的就是分组和计数了。

这里是 SQL fiddle: http://sqlfiddle.com/#!5/c1f35/37.

在 SQL 列中查找相似条目并按频率排名

Find similar entries in SQL column and rank by frequency

sql

sqlite

sql-function