Redshift POSIX 正则表达式顺序无关紧要

Question

我正在使用 POSIX 正则表达式从 AWS Redshift 查询数据。但是，我很难在不考虑顺序的情况下通过查找多个单词来找到整个字符串。

table是这样的：

ID  | full_term 
123 | juice apple farm
123 | apple juice original
123 | banana juice

例如，我要查找同时包含 apple 和 juice 的整个字符串，因此我希望得到前两行。我当前的查询：

SELECT full_term FROM data_table
WHERE full_term ~ '(.*apple+)(.*juice+).*$'

但是，此方法中的顺序很重要。我也试过 full_term ~ '(?=.*apple+)(?=.*juice+).*$' 但我收到一条错误消息 [Amazon](500310) Invalid operation: Invalid preceding regular expression prior to repetition operator. The error occurred while parsing the regular expression fragment: '(?>>>HERE>>>=.*apple+)'. 我刚刚意识到 ?= 在 Redshift 中不起作用。

在这种情况下使用 UDF 是唯一的解决方案吗？另外，我只想要完整的 apple 和 juice。也就是说，不应该包含pineapple。

Answer 1

这可能最清楚地写成 ANDed 单独的正则表达式匹配。为了确保您不匹配，例如pineapple 在查找 apple 时，您需要检查搜索词的两边是 space 字符还是行的 beginning/end：

SELECT full_term FROM data_table
WHERE full_term ~ '(^|\s)apple(\s|$)'
  AND full_term ~ '(^|\s)juice(\s|$)'

Redshift POSIX 正则表达式顺序无关紧要

Redshift POSIX regex order does not matter

regex

posix

amazon-redshift