使用 Redshift/SQL 中的多值列进行过滤

Question

我有一篇 table 的新闻文章。这些新闻文章有许多描述标题、图片等的栏目。一些列可以有多个值，例如类别可以设置为 "sports" 和 "hockey".

假设我有这个 table:

articlekey | category
---------------------
article1   | sports, hockey

实际table包含很多文章，所有文章只出现一次。我试图实现的是根据类别的两个值过滤此 table。为了能够做到这一点，我将它们分成几行并生成一个 filter-table 像这样：

articlekey | category
---------------------
article1   | sports
article1   | hockey

（顺便说一句，我们将 Tableau 用作 visualization/BI-tool，这就是我过滤的地方）

当我将这些加入 tables 并仅在 "hockey" 上过滤（包括）时，我将得到正确的结果，因为文章 1 只有一行类别设置为 "hockey".

articlekey | category         | category-filter
-----------------------------------------------
article1   | sports, hockey   | sports          <-- this will be excluded
article1   | sports, hockey   | hockey          <-- this is included

但是，如果我尝试排除 "hockey"，则文章将显示类别设置为 "sports"，因为它会保留在类别 "sports" 的结果中。我希望它完全排除文章的结果。

articlekey | category         | category-filter
-----------------------------------------------
article1   | sports, hockey   | sports          <-- this is included, but should also be gone
article1   | sports, hockey   | hockey          <-- this will be excluded

如果可能的话，当我每列有多个值并且需要过滤（包括和排除）以便每篇文章只剩下一行时，我应该如何处理这样的数据。

Answer 1

我。如果类别的数据结构为 'normalized'，即类别字段中没有多个值（如 'filter-table'）：

我认为解决此问题的首选方法是用 1 代替 'hockey'，用 0 代替其他所有内容，然后按 articlekey 对组中的这些数字求和。总和为 0 的文章关键字是没有 'hockey' 类别的文章。

所以这是对没有 'hockey' 类别的文章的查询：

select articlekey
from articles 
group by articlekey 
having sum(case when category = 'hockey' then 1 else 0 end) = 0;

您可以概括这一点：例如，如果您需要既没有 'hockey' 也没有 'sports' 但同时具有 'soccer' 和 'boxing' 类别的文章：

select articlekey
from articles 
group by articlekey 
having sum(
  case when category = 'hockey' then 1
       when category = 'sports' then 1
       else 0 
  end
) = 0
and sum(
  case when category = 'soccer' then 1
       when category = 'boxing' then 1
       else 0 
  end
) = 2;

不过你也可以 1. 过滤类别（曲棍球） 2. group by on articleKey 3. 计数匹配 4.左加入

所以这是另一个解决方案：

select * from articles left join (
  select articlekey, count(articlekey) as countOfHockey 
  from articles where category = 'hockey' group by articlekey
) hhh on articles.articlekey=hhh.articlekey where countOfHockey is null;

Sql fiddle: http://sqlfiddle.com/#!17/27ae1/33

二．如果您有非规范化的类别字段，即类别列表作为逗号分隔的值列表（如您原来的 table），您可以使用 SQL like %% operator on them and write queries like these：

create table if not exists articles(articlekey varchar, category varchar);
insert into articles values('article1', 'sports, hockey');
insert into articles values('article2', 'sports');
insert into articles values('article3', 'soccer, boxing, sprint');
insert into articles values('article4', 'soccer, sprint');

select * from articles where ', '||category||',' not like '%, hockey,%';

如果您需要既没有 'hockey' 也没有 'sports' 但同时具有 'soccer' 和 'boxing' 类别的文章，您也可以概括这一点：

select * from articles where 
', '||category||',' not like '%, hockey,%' and
', '||category||',' not like '%, sports,%' and
', '||category||',' like '%, soccer,%' and
', '||category||',' like '%, boxing,%';

但是请注意，这种方法通常不是处理关系数据库中数据的首选方法。

使用 Redshift/SQL 中的多值列进行过滤

Filter with multivalue columns in Redshift/SQL

sql

tableau-api

amazon-redshift