从以特定模式开头的文件中获取正则表达式模式

Question

我正在尝试构建一个 shell 脚本，它将使用 while 循环读取文件 (scope.txt)。范围文件包含网站域。该循环将遍历 scope.txt 文件并在另一个名为 urls.txt 的文件中搜索该域。我需要 grep urls.txt 文件中的模式，并且需要最后提到的结果。

范围文件包含-

google.com
facebook.com

URL 文件内容 -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com

我需要的输出-

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

因为生成的输出包含 scope.txt 文件中提到的特定域的所有域和子域。

我试图构建一个 shell 脚本文件，但没有得到所需的输出 shell脚本内容-

while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt

Answer 1

您可以使用这个 grep + sed 解决方案：

grep -Ef <(sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt) urls.txt

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

sed 命令的输出是构建我们在 grep 中使用的正确正则表达式：

sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt

^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com

Answer 2

使用您展示的示例，请尝试以下操作。

awk '
FNR==NR{
  arr[[=10=]]
  next
}
{
  for(key in arr){
    if([=10=]~/^https?:\/\// && [=10=] ~ key"/"){
      print
      next
    }
  }
}
' scope urlfile

说明： 为以上添加详细说明。

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition which will be TRUE when scope file.
  arr[[=11=]]              ##Creating array arr with index of current line.
  next                 ##next will skip all further statements from here.
}
{
  for(key in arr){     ##Traversing through array arr here.
    if([=11=]~/^https?:\/\// && [=11=] ~ key"/"){  ##Checking if line starts from http/https AND contains key/ here then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' scope urlfile        ##Mentioning Input_file names here.

从以特定模式开头的文件中获取正则表达式模式

Grep a regex pattern from file which starts with certain pattern

awk

grep

sed