使用 REGEX_EXTRACT_ALL 但投影时我得到“()”
using REGEX_EXTRACT_ALL but on projection I'm getting "()"
我正在使用 Cloudera - quickstat 5.4。我有一个文件,每一行都有数据,如:
323.81.303.680 - - [25/Oct/2011:01:41:00 -0500] "GET /download/download6.zip HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19"
在 apache pig 中,我使用的脚本如下:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(.+?)” (\S+) (\S+) “([^”]*)” “([^”]*)”')) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray);
DUMP B;
以上查询的输出给出类似
的输出
()
()
谁能告诉我我做错了什么?正则表达式可以吗?
在末尾添加 , line
,在 chararray 之后)和 ;
之前:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE FLATTEN(
REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\S+) (\S+) "([^"]*)" "([^"]*)"'))
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray)
, line;
DUMP B;
至于正则表达式,它与示例字符串匹配得很好,请参阅 regex demo。
我正在使用 Cloudera - quickstat 5.4。我有一个文件,每一行都有数据,如:
323.81.303.680 - - [25/Oct/2011:01:41:00 -0500] "GET /download/download6.zip HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19"
在 apache pig 中,我使用的脚本如下:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(.+?)” (\S+) (\S+) “([^”]*)” “([^”]*)”')) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray);
DUMP B;
以上查询的输出给出类似
的输出()
()
谁能告诉我我做错了什么?正则表达式可以吗?
在末尾添加 , line
,在 chararray 之后)和 ;
之前:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE FLATTEN(
REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\S+) (\S+) "([^"]*)" "([^"]*)"'))
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray)
, line;
DUMP B;
至于正则表达式,它与示例字符串匹配得很好,请参阅 regex demo。