Python: 列出一个没有索引的网站的所有网址

Question

我可以单独访问以下每个 URL：http://www.example.com/{.*}.html

然而，对主页 http://www.example.com 的访问受到某种限制，我被重定向到一个错误页面，显示：Erreur 403 - Refus de traitement de la requête (Interdit - Forbidden).

有没有办法列出该域下托管的 HTML 个页面的所有 URL？

Answer 1

简短的回答是否定的。您不能像列出目录那样只列出该域中的所有 HTML 页面。假设网站的 robots.txt 允许，最好的办法是使用网络抓取模块来抓取网站，例如 http://scrapy.org/

Answer 2

感谢 Brian：我设法从域下托管的可访问 HTML 页面列表开始抓取。

# scrap.py

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/***.html'  # Accessible URL
    ]

    rules = (
        Rule(LinkExtractor(allow=('\.html', )), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print response.url

然后：

$ scrapy runspider scrap.py > urls.out

Python: 列出一个没有索引的网站的所有网址

Python: listing all the URLs of a website without index

python

url

web-crawler