我如何使这个网络爬虫无限？

Question

这是我要编写的代码，（一个循环遍历 link 列表的网络爬虫，其中第一个 link 是原始代码，然后是 links 在网站上附加到列表并且 for 循环不断遍历列表，由于某种原因脚本在附加和打印大约 150 links 时停止）

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

我该怎么做才能让它无限大？

Answer 1

当您找到没有 href 属性的 <a> 元素时，该错误看起来像是发生了。在尝试调用 startswith 之前，您应该检查 link 是否确实有 href。

Answer 2

萨米尔·查欣，

你的代码失败了，因为 href 变量在

中是 none

href = link.get("href")

所以在那里再放一张支票：

if (href is not none) and href.startswith("http://")

请转换python代码中的逻辑

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))

我如何使这个网络爬虫无限？

How do I make this Web Crawler infinite?

python

beautifulsoup

web-crawler

web-scraping

python-requests