从域名中提取 TLD 并根据 TLD 对它们进行分组

Extract TLD from a domain name and group them based on TLD

我需要从网址中提取 TLD,如果它与一组预定义的 TLD (com,edu,nz,au) 相匹配,我需要对其进行总结。如果它与预定义的 TLD 不匹配,则应归入“其他”类别。如果网址不适用于特定业务,则应将其归类为“不可用”。

预期输出:

CLIENT TYPE     TOTAL
-------------  ----------
com             4
au              5
nz              0
Not Available   0
Other           0

我编写了以下查询,但它没有给我具有 0 值的行。

select tld2, NVL(cnt,0)  from (select REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) as tld2, count(*) cnt from client group by REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1))a where tld2 in ('com','edu','gov','org')
UNION ALL
select 'Not Available' as tld2, COUNT(webaddress) from client where webaddress is null
UNION
select 'Other' as tld2, NVL(cnt,0)  from (select REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) as tld2, count(*) cnt from client group by REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1))a where tld2 not in ('com','edu','gov','org');

如果我应该在这里使用案例,有人可以指导我吗?

看看这个例子:

with sub as (
select case when webaddress is null then 'Not Available' 
             when domain_name  in ('com','edu','gov','org') then domain_name else 'Other' end client_type 
             from (
SELECT 
    regexp_substr(webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL,
                  1) domain_name,
    webaddress
FROM
    (
        SELECT
            ' webaddress
        FROM
            dual
        UNION ALL
        SELECT
            'https://Whosebug.edu/questions/65096217/' 
        FROM
            dual
        UNION ALL
        SELECT
            'https://Whosebug.edu/questions/6509621/' 
        FROM
            dual
        UNION ALL
        select 'https://Whosebug.de/questions/65096217/' 
        from dual
        /*UNION ALL
        select null 
        from dual*/
    ))),
cat as (select regexp_substr('Not Available,com,edu,gov,org,Other','[^,]+', 1, level ) val
from dual
connect by regexp_substr('Not Available,com,edu,gov,org,Other', '[^,]+', 1, level) is not null)
select c.val, sum(case when s.client_type is null then 0 else 1 end)
from sub s right outer join cat c on (c.val = s.client_type)
group by c.val;

上一个(不完整的 sol): 使用简单的 group by 和 case stmt 的非常简单的解决方案。可能是这样的:

select  case when s.webaddress is null then 'Not Available' 
             when s.domain_name  in ('com','edu','gov','org') then s.domain_name else 'Other' end client_type, 
             count(*)
        from (SELECT
    regexp_substr(webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL,
                  1) domain_name,
    webaddress
FROM
    (
        SELECT
            ' webaddress
        FROM
            dual
        UNION ALL
        SELECT
            'https://Whosebug.edu/questions/65096217/' 
        FROM
            dual
        UNION ALL
        SELECT
            'https://Whosebug.edu/questions/6509621/' 
        FROM
            dual
        UNION ALL
        SELECT
            NULL 
        FROM
            dual
        UNION ALL
        select 'https://Whosebug.de/questions/65096217/' 
        from dual
    )) s
        group by case when s.webaddress is null then 'Not Available' 
             when s.domain_name  in ('com','edu','gov','org') then s.domain_name else 'Other' end
;

请尝试对您的方法进行一些修改:

select tld2, NVL(cnt,0)  from (select REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) as tld2, count(*) cnt from client group by REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1))a where tld2 in ('com','edu','gov','org')
UNION ALL
select 'Not Available' as tld2, cnt from (select COUNT(webaddress) cnt from client where webaddress is null)
UNION
select 'Other' as tld2, cnt  from (select count(webaddress) cnt from client where REGEXP_SUBSTR (webaddress, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) not in ('com','edu','gov','org'))a ;

谢谢。

您可以创建一个 Java 函数来查找 TLD(因为您的正则表达式无法处理存在端口号的情况,以及其他可能的边缘情况,例如 https://localhost/not/at/example.com/,以及使用旨在处理 URI 的 API 会更好):

CREATE AND COMPILE JAVA SOURCE NAMED URIHandler AS
import java.net.URI;
import java.net.URISyntaxException;

public class URIHandler {
  public static String getTLD( final String url )
  {
    String domain = null;
    try
    {
      URI uri = new URI( url );
      domain = uri.getHost();
    }
    catch ( URISyntaxException ex )
    {
    }
    if ( domain == null )
    {
        return null;
    }
    int index = domain.lastIndexOf( "." );
    return ( index >= 0 ? domain.substring( index + 1 ) : domain );
  }
}
/

然后您可以将其包装在 PL/SQL 函数中:

CREATE FUNCTION getTLD( url IN VARCHAR2 ) RETURN VARCHAR2
AS LANGUAGE JAVA NAME 'URIHandler.getTLD( java.lang.String ) return java.lang.String';
/

那么您可以使用代码:

WITH tlds ( tld ) AS (
  SELECT 'Not Available' FROM DUAL UNION ALL
  SELECT 'com'           FROM DUAL UNION ALL
  SELECT 'nz'            FROM DUAL UNION ALL
  SELECT 'au'            FROM DUAL UNION ALL
  SELECT 'Other'         FROM DUAL
),
matches ( match ) AS (
  SELECT DECODE(
           getTLD( url ),
           NULL,  'Not Available',
           'com', 'com',
           'au',  'au',
           'nz',  'nz',
                  'Other'
         )
  FROM   table_name
)
SELECT t.tld,
       COUNT( m.match )
FROM   tlds t
       LEFT OUTER JOIN matches m
       ON ( t.tld = m.match )
GROUP BY
       t.tld;

其中,对于示例数据:

CREATE TABLE table_name ( url ) AS
SELECT 'http://example.com'      FROM DUAL UNION ALL
SELECT 'http://example.com:80/'  FROM DUAL UNION ALL
SELECT 'https://example.au'      FROM DUAL UNION ALL
SELECT 'https://example.au:442/' FROM DUAL UNION ALL
SELECT 'https://example.nz/not/at/example.com/' FROM DUAL UNION ALL
--SELECT 'https://example.net'     FROM DUAL UNION ALL
SELECT 'not a URI' FROM DUAL;

输出:

TLD           | COUNT(M.MATCH)
:------------ | -------------:
Other         |              0
com           |              2
nz            |              1
au            |              2
Not Available |              1

db<>fiddle here