灵活的网络爬虫

Question

我暂时被我的网络爬虫困住了。到目前为止的代码是：

import requests
from bs4 import BeautifulSoup

def search_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://www.thenewboston.com/search.php?type=1&sort=pop&page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'user-name'}):
            href = "https://www.thenewboston.com/" + link.get('href')
            print(href)
search_spider()

这是 YT 教程中的示例。有谁知道当我没有像 1,2,3 这样的网站结尾时我必须如何更改代码...但是各种数字如 021587、0874519、NI875121？ Anker 网站域名始终相同，但结尾并不像本例中那样直截了当。所以我需要知道的是如何为 str(page) 插入一个变量，该变量从我计算机上的 .txt 文件（几百个）或当我将它们复制并粘贴到我的列表时获取网站结束号代码？当然 Python 应该在到达列表末尾时停止。

据我所知python，我暂时不知道如何解决这个问题。如果您需要更多信息，请告诉我。感谢您的回复！

弗洛

Answer 1

好吧，如果您有要访问的页面列表而不是一系列数字，您可以这样做：

pages = ['021587', '0874519', 'NI875121']

for page in pages:
    url = 'http://example.com/some-path/' + str(page)

从文件中读入：

with open('filename.txt') as f:
    contents = f.read()

假设您的页面由空格分隔，那么您可以运行

pages = contents.split()

查看 documentation for str.split()

灵活的网络爬虫

Flexible Web Crawler

python

variables

web-crawler