Apache Pig:REGEX_EXTRACT 来自包含路径的字符串的第二个目录名称

Apache Pig: REGEX_EXTRACT second dir name from a string containing a path

下午好,

从 NameNode 日志中提取数据并过滤后,我得到如下输出:

DUMP USERS_AND_DIRS;

(UserName,/user/UserName/Dir1,/user/UserName/Dir2)
(UserName2,/user/UserName2/Dir1,/user/UserName2/Dir2)
(UserName,/hdfs/data/Dir1,/hdfs/data/Dir2)
(UserName,/user/UserName/Dir1,/user/UserName2/Dir2)

AS(用户、来源、目的地)

现在我想按用户使用不属于他们自己的目录进行过滤。 我有:

ONLY_IN_USER_DIR = FILTER USERS_AND_DIRS BY (Source MATCHES '/user/(.*)') OR (Destination MATCHES '/user/(.*)');

效果很好。但这不起作用:

USING_DIR_OF_OTHER_USER = FILTER ONLY_IN_USER_DIR BY NOT(Source MATCHES '/user/$User/(.*)') OR NOT (Destination MATCHES '/user/$User/(.*)');

鉴于输入,我只想得到:

(UserName,/user/UserName/Dir1,/user/UserName2/Dir2)

哪些用户正在访问自己目录以外的目录中的文件,作为源或目标。我尝试了不同的东西,但我找不到该怎么做?


编辑:我想我可以做类似

的事情
DIR_OF_OTHER_USER = FILTER ONLY_IN_USER_DIR BY User != REGEX_EXTRACT(Source,'/(.*)/',2) OR User != REGEX_EXTRACT(Destination,'/(.*)/',2);

但是当 Source 是“/user/UserDir/Dir1/Dir2”时,REGEX_EXTRACT(Source,'/(.*)/',1) returns (user)REGEX_EXTRACT(Source,'/(.*)/',2) returns 没有。

所以我想我现在的问题是:如何从包含路径的字符串中提取第二个目录?从“/user/UserDir/Dir1/Dir2”,我想提取 "UserDir"

好的,找到了:

DIR_OF_OTHER_USER = FILTER ONLY_IN_USER_DIR BY (
                        User != REGEX_EXTRACT(Source, '^\/([^\/]*)\/([^\/]*)', 2) AND REGEX_EXTRACT(Source, '^\/([^\/]*)\/([^\/]*)', 1) == 'user'
                        ) OR (
                        User != REGEX_EXTRACT(Destination, '^\/([^\/]*)\/([^\/]*)', 2) AND REGEX_EXTRACT(Source, '^\/([^\/]*)\/([^\/]*)', 1) == 'user'
                        );

这会获取用户使用的所有文件,这些文件在 /user/* 目录中,但不在用户目录 /user/UserName/ 中。虽然可能有更简单的答案。

how can I extract the second directory from a string containing a path ? From "/user/UserDir/Dir1/Dir2" and I would like to extract "UserDir"

怎么样

STRSPLIT(Source,'/').;