解析登陆页面的链接时无法提高性能

Unable to boost the performance while parsing links from landing pages

我正在尝试使用 concurrent.futures 在以下脚本中实现多处理。问题是即使我使用 concurrent.futures,性能仍然是一样的。它似乎对执行过程没有任何影响,这意味着它无法提高性能。

我知道如果我创建另一个函数并将从 get_titles() 填充的链接传递给该函数以便从它们的内页中抓取标题,我可以使这个 concurrent.futures 工作。但是,我希望使用我在下面创建的函数从登陆页面获取标题。

我使用迭代方法而不是递归只是因为如果我选择后者,当调用超过 1000 次时函数将抛出递归错误。

这是我迄今为止尝试过的方式 (the site link that I've used within the script is a placeholder):

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://whosebug.com'
link = 'https://whosebug.com/questions/tagged/web-scraping'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    while True:
        res = requests.get(link,headers=headers)
        soup = BeautifulSoup(res.text,"html.parser")
        for item in soup.select(".summary > h3"):
            post_title = item.select_one("a.question-hyperlink").get("href")
            print(urljoin(base,post_title))

        next_page = soup.select_one(".pager > a[rel='next']")

        if not next_page: return
        link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in [link]}
        futures.as_completed(future_to_url)

问题:

How can I improve the performance while parsing links from landing pages?

编辑: 我知道我可以按照下面的路线实现相同的目标,但是 这不是我最初尝试的样子

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://whosebug.com'
links = ['https://whosebug.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in links}
        futures.as_completed(future_to_url)

既然您的抓取工具正在使用线程,为什么不“产生”更多的工作程序来处理来自着陆页的后续 URL?

例如:

import concurrent.futures as futures
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base = "https://whosebug.com"
links = [
    f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
    for i in range(1, 5)
]

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}


def threader(function, target, workers=5):
    with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        jobs = {executor.submit(function, item): item for item in target}
        futures.as_completed(jobs)


def make_soup(page_url: str) -> BeautifulSoup:
    return BeautifulSoup(requests.get(page_url).text, "html.parser")


def process_page(page: str):
    s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
    views = s.getText() if s is not None else "Missing data"
    print(f"{page}\n{' '.join(views.split())}")


def make_pages(soup_of_pages: BeautifulSoup) -> list:
    return [
        urljoin(base, item.select_one("a.question-hyperlink").get("href"))
        for item in soup_of_pages.select(".summary > h3")
    ]


def crawler(link):
    while True:
        soup = make_soup(link)
        threader(process_page, make_pages(soup), workers=10)
        next_page = soup.select_one(".pager > a[rel='next']")
        if not next_page:
            return
        link = urljoin(base, next_page.get("href"))


if __name__ == '__main__':
    threader(crawler, links)

示例 运行 输出:


Viewed 19 times

Viewed 32 times

Viewed 22 times

and more ...

理由:

从本质上讲,您在初始方法中所做的是催生工作人员从搜索页面获取问题 URL。您不处理以下网址。

我的建议是派生更多的工作人员来处理爬行工作人员收集的内容。

在你的问题中你提到:

I wish to get the titles from landing pages

这就是您的初始方法的调整版本试图通过使用 threader() 函数来实现的,它基本上是 ThreadPool().

的包装器

Python 不能很好地处理并发性,您可以通过让 Python 脚本处理单个 link 来绕过它,然后使用 Bash 来实现并发。这是一个例子:

python代码,我们称它为crawlLink.py:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import sys

base = 'https://whosebug.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

link = sys.argv[1]
get_titles(link)

bash脚本:

#! /bin/bash

links=""

for page in {1..5}
do
    links="${links} https://whosebug.com/questions/tagged/web-scraping?tab=newest&page=${page}&pagesize=30"
done

echo "${links}" | xargs  -i --max-procs=5 bash -c '/usr/bin/python3 crawlLink.py "{}"'
  1. 我希望您实际上没有像您的示例中所示那样生成单个线程:)
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
  1. 通过简单地使用 Session(这意味着重新使用连接,明确声明您可以再次查询该站点,直到会话对象出现之前),您将加快很多明确地完成或垃圾收集)而不是普通的 requests.get() 调用。
  2. 事实上,由于 GIL,python 线程不擅长 CPU 绑定的任务(如 HTML 的解析),您可能应该使用 ProcessPoolExecutor用于解析,并留给 ThreadPoolExecutor(或什至单个线程)仅处理 HTTP 请求。
  3. 毕竟,我强烈建议看一下 aiohttp 作为 requests 的非阻塞继承者(实际上是 urllib,但 nvm)——它是内置的在 asyncio 之上,这样它就可以忘记线程安全问题、隐式锁等等......
    In [3]: import aiohttp, asyncio, time
       ...: 
       ...: t0 = time.monotonic()
       ...: 
       ...: 
       ...: async def do_the_request(session):
       ...:     async with session.get("http://www.google.com") as resp:
       ...:         content = await resp.read()
       ...: 
       ...: 
       ...: async def main():
       ...:     async with aiohttp.ClientSession() as session:
       ...:         tasks = {asyncio.create_task(do_the_request(session)) for _ in r
       ...: ange(100)}
       ...:         await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED)
       ...: 
       ...: 
       ...: t0 = time.monotonic()
       ...: asyncio.run(main())
       ...: t1 = time.monotonic()
       ...: print(f"Time elapsed: {t1 - t0:.3f}")
       Time elapsed: 0.571
    
       In [4]: 
    
  4. 最后,如果您达到 CPU 核心的极限(我敢打赌您会提前耗尽带宽)运行 您的进程 - 只需使用 [=13= 启动另一个进程].