镜像整个网站并将链接保存在txt文件中
Mirror entire website and save links in txt file
是否可以使用 wget 镜像保存整个网站的所有链接并将其保存在 txt 文件中?
如果可以的话,是怎么做到的?如果不行,还有其他方法吗?
编辑:
我试过 运行 这个:
wget -r --spider example.com
得到这个结果:
Spider mode enabled. Check if remote file exists.
--2015-10-03 21:11:54-- http://example.com/
Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.
--2015-10-03 21:11:54-- http://example.com/
Reusing existing connection to example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: 'example.com/index.html'
100%[=====================================================================================================>] 1,270 --.-K/s in 0s
2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270]
Removing example.com/index.html.
Found no broken links.
FINISHED --2015-10-03 21:11:54--
Total wall clock time: 0.3s
Downloaded: 1 files, 1.2K in 0s (93.2 MB/s)
(Yes, I also tried using other websites with more internal links)
是的,使用 wget 的 --spider
选项。命令如:
wget -r --spider example.com
将所有链接的深度降低到 5(默认值)。然后您可以将输出捕获到一个文件中,或许可以随时清理它。类似于:
wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt
只会将链接放入 weblinks.txt
文件(如果您的 wget 版本输出略有不同,您可能需要稍微调整该命令)。
或使用python:
例如
import urllib, re
def do_page(url):
f = urllib.urlopen(url)
html = f.read()
pattern = r"'{}.*.html'".format(url)
hits = re.findall(pattern, html)
return hits
if __name__ == '__main__':
hits = []
url = 'http://thehackernews.com/'
hits.extend(do_page(url))
with open('links.txt', 'wb') as f1:
for hit in hits:
f1.write(hit)
输出:
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'
是否可以使用 wget 镜像保存整个网站的所有链接并将其保存在 txt 文件中?
如果可以的话,是怎么做到的?如果不行,还有其他方法吗?
编辑:
我试过 运行 这个:
wget -r --spider example.com
得到这个结果:
Spider mode enabled. Check if remote file exists.
--2015-10-03 21:11:54-- http://example.com/
Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.
--2015-10-03 21:11:54-- http://example.com/
Reusing existing connection to example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: 'example.com/index.html'
100%[=====================================================================================================>] 1,270 --.-K/s in 0s
2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270]
Removing example.com/index.html.
Found no broken links.
FINISHED --2015-10-03 21:11:54--
Total wall clock time: 0.3s
Downloaded: 1 files, 1.2K in 0s (93.2 MB/s)
(Yes, I also tried using other websites with more internal links)
是的,使用 wget 的 --spider
选项。命令如:
wget -r --spider example.com
将所有链接的深度降低到 5(默认值)。然后您可以将输出捕获到一个文件中,或许可以随时清理它。类似于:
wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt
只会将链接放入 weblinks.txt
文件(如果您的 wget 版本输出略有不同,您可能需要稍微调整该命令)。
或使用python:
例如
import urllib, re
def do_page(url):
f = urllib.urlopen(url)
html = f.read()
pattern = r"'{}.*.html'".format(url)
hits = re.findall(pattern, html)
return hits
if __name__ == '__main__':
hits = []
url = 'http://thehackernews.com/'
hits.extend(do_page(url))
with open('links.txt', 'wb') as f1:
for hit in hits:
f1.write(hit)
输出:
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'