从以特定模式开头的文件中获取正则表达式模式
Grep a regex pattern from file which starts with certain pattern
我正在尝试构建一个 shell 脚本,它将使用 while 循环读取文件 (scope.txt)。范围文件包含网站域。该循环将遍历 scope.txt 文件并在另一个名为 urls.txt 的文件中搜索该域。我需要 grep urls.txt 文件中的模式,并且需要最后提到的结果。
范围文件包含-
google.com
facebook.com
URL 文件内容 -
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com
我需要的输出-
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
因为生成的输出包含 scope.txt 文件中提到的特定域的所有域和子域。
我试图构建一个 shell 脚本文件,但没有得到所需的输出
shell脚本内容-
while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt
您可以使用这个 grep + sed
解决方案:
grep -Ef <(sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt) urls.txt
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
sed
命令的输出是构建我们在 grep
中使用的正确正则表达式:
sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt
^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com
使用您展示的示例,请尝试以下操作。
awk '
FNR==NR{
arr[[=10=]]
next
}
{
for(key in arr){
if([=10=]~/^https?:\/\// && [=10=] ~ key"/"){
print
next
}
}
}
' scope urlfile
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when scope file.
arr[[=11=]] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
{
for(key in arr){ ##Traversing through array arr here.
if([=11=]~/^https?:\/\// && [=11=] ~ key"/"){ ##Checking if line starts from http/https AND contains key/ here then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
}
}
' scope urlfile ##Mentioning Input_file names here.
我正在尝试构建一个 shell 脚本,它将使用 while 循环读取文件 (scope.txt)。范围文件包含网站域。该循环将遍历 scope.txt 文件并在另一个名为 urls.txt 的文件中搜索该域。我需要 grep urls.txt 文件中的模式,并且需要最后提到的结果。
范围文件包含-
google.com
facebook.com
URL 文件内容 -
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com
我需要的输出-
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
因为生成的输出包含 scope.txt 文件中提到的特定域的所有域和子域。
我试图构建一个 shell 脚本文件,但没有得到所需的输出 shell脚本内容-
while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt
您可以使用这个 grep + sed
解决方案:
grep -Ef <(sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt) urls.txt
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
sed
命令的输出是构建我们在 grep
中使用的正确正则表达式:
sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt
^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com
使用您展示的示例,请尝试以下操作。
awk '
FNR==NR{
arr[[=10=]]
next
}
{
for(key in arr){
if([=10=]~/^https?:\/\// && [=10=] ~ key"/"){
print
next
}
}
}
' scope urlfile
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when scope file.
arr[[=11=]] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
{
for(key in arr){ ##Traversing through array arr here.
if([=11=]~/^https?:\/\// && [=11=] ~ key"/"){ ##Checking if line starts from http/https AND contains key/ here then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
}
}
' scope urlfile ##Mentioning Input_file names here.