蜂巢中的正则表达式

Question

我正在学习 Hive 中的简单正则表达式。我正在学习教程，但简单的 hql 语句出现错误？

select REGEXP_EXTRACT( 'Hello, my name is Ben. Please visit' , 'Ben' )

这是我收到的错误消息：

Wrong arguments ''Ben'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@ec0c06f of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {Hello, my name is Ben. Please visit:java.lang.String, Ben:java.lang.String} of size 2

它适用于其他语言，但我想在 Hive 中学习它。任何帮助将不胜感激。

Answer 1

您必须提供第三个参数，即要提取的组索引。

要提取完整匹配项，请使用 0:

select REGEXP_EXTRACT( 'Hello, my name is Ben. Please visit' , 'Ben', 0)

要提取捕获组值，请使用组索引，例如

select REGEXP_EXTRACT( 'Hello, my name is Ben. Please visit' , 'name is (\w+)', 1)

将提取 Ben.

见this reference:

regexp_extract(string subject, string pattern, int index)
Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

蜂巢中的正则表达式

regular expression in hive

regex

hive

hiveql