根据唯一值过滤列表中的每个子域
Filter each subdomain in list based on unique value
我有两个列表或网址
首先listofdomains.txt
包含如下
http://example.com
https://www.example.com
https://abc-test.example.com
秒urls_params.txt
包含如下
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
https://www.example.com/?param1=123
https://www.example.com/?param1=123¶m2=456
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123¶m2=456
我需要在两个列表之间循环以从 urls_params.txt
所有 url 属于每个子域并用子域保存它 name.txt
例如,所需的输出将是
名为 example.com
并包含
的文件
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
其余子域依此类推
我的解决方法是过滤 listofdomains.txt
列表只作为
example.com
www.example.com
abc-test.example.com
并将其保存在名为 list
的文件中
然后执行以下命令
while read -r url; do $(cat urls_params.txt | awk -v u="$url" '{print u}') ; done < list
但输出错误
example.com: command not found
www.example.com: command not found
abc-test.example.com: command not found
谢谢
找到了
while read -r url ; do cat urls_params.txt | grep -E "$url" | tee $url.txt ; done < list
输入(来自问题):
$ ls
listofdomains.txt tst.awk urls_params.txt
脚本:
$ cat tst.awk
{
dom = [=11=]
sub("https?://","",dom)
sub("/.*","",dom)
}
NR==FNR {
dom2urls[dom] = dom2urls[dom] [=11=] ORS
next
}
dom != prev {
close(out)
out = dir "/" dom
prev = dom
}
{ printf "%s", dom2urls[dom] > out }
执行:
$ awk -v dir="$PWD" -f tst.awk urls_params.txt listofdomains.txt
输出:
$ ls
abc-test.example.com example.com listofdomains.txt tst.awk urls_params.txt www.example.com
$ head *.com
==> abc-test.example.com <==
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123¶m2=456
==> example.com <==
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
==> www.example.com <==
https://www.example.com/?param1=123
https://www.example.com/?param1=123¶m2=456
您实际上并不需要 listofdomains.txt
,除非您希望从输出中排除某些域,或者您希望获取空输出文件的某些域未包含在 urls_params.txt
中。
如果您只想为在 urls_params.txt
文件中有条目的域创建输出文件(即没有空输出文件),那么只需更改:
{ printf "%s", dom2urls[dom] > out }
至:
dom in dom2urls { printf "%s", dom2urls[dom] > out }
我有两个列表或网址
首先listofdomains.txt
包含如下
http://example.com
https://www.example.com
https://abc-test.example.com
秒urls_params.txt
包含如下
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
https://www.example.com/?param1=123
https://www.example.com/?param1=123¶m2=456
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123¶m2=456
我需要在两个列表之间循环以从 urls_params.txt
所有 url 属于每个子域并用子域保存它 name.txt
例如,所需的输出将是
名为 example.com
并包含
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
其余子域依此类推
我的解决方法是过滤 listofdomains.txt
列表只作为
example.com
www.example.com
abc-test.example.com
并将其保存在名为 list
的文件中
然后执行以下命令
while read -r url; do $(cat urls_params.txt | awk -v u="$url" '{print u}') ; done < list
但输出错误
example.com: command not found
www.example.com: command not found
abc-test.example.com: command not found
谢谢
找到了
while read -r url ; do cat urls_params.txt | grep -E "$url" | tee $url.txt ; done < list
输入(来自问题):
$ ls
listofdomains.txt tst.awk urls_params.txt
脚本:
$ cat tst.awk
{
dom = [=11=]
sub("https?://","",dom)
sub("/.*","",dom)
}
NR==FNR {
dom2urls[dom] = dom2urls[dom] [=11=] ORS
next
}
dom != prev {
close(out)
out = dir "/" dom
prev = dom
}
{ printf "%s", dom2urls[dom] > out }
执行:
$ awk -v dir="$PWD" -f tst.awk urls_params.txt listofdomains.txt
输出:
$ ls
abc-test.example.com example.com listofdomains.txt tst.awk urls_params.txt www.example.com
$ head *.com
==> abc-test.example.com <==
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123¶m2=456
==> example.com <==
http://example.com/?param1=123
http://example.com/?param1=123¶m2=456
==> www.example.com <==
https://www.example.com/?param1=123
https://www.example.com/?param1=123¶m2=456
您实际上并不需要 listofdomains.txt
,除非您希望从输出中排除某些域,或者您希望获取空输出文件的某些域未包含在 urls_params.txt
中。
如果您只想为在 urls_params.txt
文件中有条目的域创建输出文件(即没有空输出文件),那么只需更改:
{ printf "%s", dom2urls[dom] > out }
至:
dom in dom2urls { printf "%s", dom2urls[dom] > out }