在 shell 脚本中过滤路径列表
Filter a list of paths in shell script
我有一个包含路径列表(从 hadoop fs -ls
输出生成)的文件 ("dump_file"),格式如下:
d hdfs 0 2021-06-01-13:14 /dir1
d hdfs 0 2021-06-01-13:14 /dir1/dir2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir3
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir3/file1
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir3/file2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4/dir5
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir4/file11
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir4/file22
d hdfs 0 2021-06-01-13:14 /dir1/dir6/dir7
我的目标是提取任何给定节点的第一级子节点。
到目前为止,这就是我得到的(以“dir1”为例):
grep -Eio "/dir1.[^\/]+" < dump_file | sort -u | awk -F "/" '{ print $NF }'
dir2
dir6
但我也希望匹配行的第一个字段,如下所示:
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir6
"dir1/dir2" 值应该 return :
d hdfs 0 2021-06-01-13:14 dir3
d hdfs 0 2021-06-01-13:14 dir4
"dir1/dir2/dir4 :
d hdfs 0 2021-06-01-13:14 dir5
- abcdef 1201 2021-06-01-13:15 file11
- abcdef 78441 2021-06-01-13:16 file22
你知道我该怎么做吗?谢谢!
使用您显示的示例,请尝试以下 awk
代码。在这个 awk
程序的 Input_file 内部值变量中传递要查找的字符串值。
awk -v value="dir1" '
BEGIN{ len=length(value) }
match([=10=],"/"value"/[^/]*"){
matVal=substr([=10=],RSTART+len+2,RLENGTH-len-2)
if(!arr[matVal]++){
print substr([=10=],1,RSTART-1) matVal
}
}
' Input_file
解释:为以上添加详细解释。
awk -v value="dir1" ' ##Starting awk program from here, setting value to string which we want to look for.
BEGIN{ len=length(value) } ##Creating len which has length of value here in BEGIN section.
match([=11=],"/"value"/[^/]*"){ ##Using match function to match given string along with next level of it here.
matVal=substr([=11=],RSTART+len+2,RLENGTH-len-2) ##Creating matVal which has matched value sub string here.
if(!arr[matVal]++){ ##Checking condition if value already does not exist in array then do following.
print substr([=11=],1,RSTART-1) matVal ##printing rest of line and matched value(only directory level) here.
}
}
' Input_file ##Mentioning Input_file name here.
编辑: OP 的样本通过 dir1/dir2
OR dir1
OR dir1/dir2/dir3
and根据忽略诸如 foo/dir1/dir2
之类的路径的评论,其中传递的值处于子目录模式,然后可以尝试遵循,请注意,如果您的路径包含正则表达式元字符,这将失败(如果可以的话,我会尝试在某个时候修复它).
awk -v value="dir1/dir2" '
BEGIN{ len=length(value) }
match([=12=],"[[:space:]]+/"value"/[^/]*"){
matVal=substr([=12=],RSTART,RLENGTH)
sub(/^[[:space:]]+/,"",matVal)
sub("^/"value"/","",matVal)
if(!arr[matVal]++){
print substr([=12=],1,RSTART-1) OFS matVal
}
}
' Input_file
这可能是您想要做的:
$ cat tst.awk
{
head = [=10=]
sub("[[:space:]]+/.*","",head)
sub("[^/]+","")
}
index([=10=],"/" tgt "/") == 1 {
[=10=] = substr([=10=],length(tgt) + 3)
sub("/.*","")
if ( !seen[[=10=]]++ ) {
print head, [=10=]
}
}
$ awk -v tgt='dir1' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir5
$ awk -v tgt='dir1/dir2' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir3
d hdfs 0 2021-06-01-13:14 dir4
假设您只想指定头目录而不是路径作为目标的原始答案:
$ cat tst.awk
{
head = [=13=]
sub("[[:space:]]+/.*","",head)
sub("[^/]+","")
nd = split([=13=],dirs,"/")
}
(nd>2) && (dirs[2] == tgt) && !seen[dirs[3]]++ {
print head, dirs[3]
}
$ awk -v tgt='dir1' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir5
以上假定目标目录名称中的 none 包含转义序列,例如 \n
。
#!/usr/bin/perl -sl
$re = qr[^((?:\S+\s+){4})/\Q$dir\E/([^/]+)];
while (<>) {
chomp;
print . if m[$re]o and !$seen{}++;
}
perl above.pl -dir=dir1/dir2 file
这是基于使用两个捕获组的 perl 正则表达式——一个捕获前四个字段,另一个捕获“/dir1/dir2/”之后的部分。 \Q...\E
用于转义任何正则表达式元字符。
相同的正则表达式可用于 pcregrep
(+sort -u
以删除重复项):
pcregrep -o1 -o2 '^((?:\S+\s+){4})/dir1/dir2/([^/]+)' file | sort -uk5
我有一个包含路径列表(从 hadoop fs -ls
输出生成)的文件 ("dump_file"),格式如下:
d hdfs 0 2021-06-01-13:14 /dir1
d hdfs 0 2021-06-01-13:14 /dir1/dir2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir3
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir3/file1
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir3/file2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4/dir5
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir4/file11
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir4/file22
d hdfs 0 2021-06-01-13:14 /dir1/dir6/dir7
我的目标是提取任何给定节点的第一级子节点。 到目前为止,这就是我得到的(以“dir1”为例):
grep -Eio "/dir1.[^\/]+" < dump_file | sort -u | awk -F "/" '{ print $NF }'
dir2
dir6
但我也希望匹配行的第一个字段,如下所示:
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir6
"dir1/dir2" 值应该 return :
d hdfs 0 2021-06-01-13:14 dir3
d hdfs 0 2021-06-01-13:14 dir4
"dir1/dir2/dir4 :
d hdfs 0 2021-06-01-13:14 dir5
- abcdef 1201 2021-06-01-13:15 file11
- abcdef 78441 2021-06-01-13:16 file22
你知道我该怎么做吗?谢谢!
使用您显示的示例,请尝试以下 awk
代码。在这个 awk
程序的 Input_file 内部值变量中传递要查找的字符串值。
awk -v value="dir1" '
BEGIN{ len=length(value) }
match([=10=],"/"value"/[^/]*"){
matVal=substr([=10=],RSTART+len+2,RLENGTH-len-2)
if(!arr[matVal]++){
print substr([=10=],1,RSTART-1) matVal
}
}
' Input_file
解释:为以上添加详细解释。
awk -v value="dir1" ' ##Starting awk program from here, setting value to string which we want to look for.
BEGIN{ len=length(value) } ##Creating len which has length of value here in BEGIN section.
match([=11=],"/"value"/[^/]*"){ ##Using match function to match given string along with next level of it here.
matVal=substr([=11=],RSTART+len+2,RLENGTH-len-2) ##Creating matVal which has matched value sub string here.
if(!arr[matVal]++){ ##Checking condition if value already does not exist in array then do following.
print substr([=11=],1,RSTART-1) matVal ##printing rest of line and matched value(only directory level) here.
}
}
' Input_file ##Mentioning Input_file name here.
编辑: OP 的样本通过 dir1/dir2
OR dir1
OR dir1/dir2/dir3
and根据忽略诸如 foo/dir1/dir2
之类的路径的评论,其中传递的值处于子目录模式,然后可以尝试遵循,请注意,如果您的路径包含正则表达式元字符,这将失败(如果可以的话,我会尝试在某个时候修复它).
awk -v value="dir1/dir2" '
BEGIN{ len=length(value) }
match([=12=],"[[:space:]]+/"value"/[^/]*"){
matVal=substr([=12=],RSTART,RLENGTH)
sub(/^[[:space:]]+/,"",matVal)
sub("^/"value"/","",matVal)
if(!arr[matVal]++){
print substr([=12=],1,RSTART-1) OFS matVal
}
}
' Input_file
这可能是您想要做的:
$ cat tst.awk
{
head = [=10=]
sub("[[:space:]]+/.*","",head)
sub("[^/]+","")
}
index([=10=],"/" tgt "/") == 1 {
[=10=] = substr([=10=],length(tgt) + 3)
sub("/.*","")
if ( !seen[[=10=]]++ ) {
print head, [=10=]
}
}
$ awk -v tgt='dir1' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir5
$ awk -v tgt='dir1/dir2' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir3
d hdfs 0 2021-06-01-13:14 dir4
假设您只想指定头目录而不是路径作为目标的原始答案:
$ cat tst.awk
{
head = [=13=]
sub("[[:space:]]+/.*","",head)
sub("[^/]+","")
nd = split([=13=],dirs,"/")
}
(nd>2) && (dirs[2] == tgt) && !seen[dirs[3]]++ {
print head, dirs[3]
}
$ awk -v tgt='dir1' -f tst.awk file
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir5
以上假定目标目录名称中的 none 包含转义序列,例如 \n
。
#!/usr/bin/perl -sl
$re = qr[^((?:\S+\s+){4})/\Q$dir\E/([^/]+)];
while (<>) {
chomp;
print . if m[$re]o and !$seen{}++;
}
perl above.pl -dir=dir1/dir2 file
这是基于使用两个捕获组的 perl 正则表达式——一个捕获前四个字段,另一个捕获“/dir1/dir2/”之后的部分。 \Q...\E
用于转义任何正则表达式元字符。
相同的正则表达式可用于 pcregrep
(+sort -u
以删除重复项):
pcregrep -o1 -o2 '^((?:\S+\s+){4})/dir1/dir2/([^/]+)' file | sort -uk5