解析登陆页面的链接时无法提高性能

Question

我正在尝试使用 concurrent.futures 在以下脚本中实现多处理。问题是即使我使用 concurrent.futures，性能仍然是一样的。它似乎对执行过程没有任何影响，这意味着它无法提高性能。

我知道如果我创建另一个函数并将从 get_titles() 填充的链接传递给该函数以便从它们的内页中抓取标题，我可以使这个 concurrent.futures 工作。但是，我希望使用我在下面创建的函数从登陆页面获取标题。

我使用迭代方法而不是递归只是因为如果我选择后者，当调用超过 1000 次时函数将抛出递归错误。

这是我迄今为止尝试过的方式 (the site link that I've used within the script is a placeholder)：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://whosebug.com'
link = 'https://whosebug.com/questions/tagged/web-scraping'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    while True:
        res = requests.get(link,headers=headers)
        soup = BeautifulSoup(res.text,"html.parser")
        for item in soup.select(".summary > h3"):
            post_title = item.select_one("a.question-hyperlink").get("href")
            print(urljoin(base,post_title))

        next_page = soup.select_one(".pager > a[rel='next']")

        if not next_page: return
        link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in [link]}
        futures.as_completed(future_to_url)

问题：

How can I improve the performance while parsing links from landing pages?

编辑：我知道我可以按照下面的路线实现相同的目标，但是 这不是我最初尝试的样子

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://whosebug.com'
links = ['https://whosebug.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in links}
        futures.as_completed(future_to_url)

Answer 1

既然您的抓取工具正在使用线程，为什么不“产生”更多的工作程序来处理来自着陆页的后续 URL？

例如：

import concurrent.futures as futures
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base = "https://whosebug.com"
links = [
    f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
    for i in range(1, 5)
]

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}


def threader(function, target, workers=5):
    with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        jobs = {executor.submit(function, item): item for item in target}
        futures.as_completed(jobs)


def make_soup(page_url: str) -> BeautifulSoup:
    return BeautifulSoup(requests.get(page_url).text, "html.parser")


def process_page(page: str):
    s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
    views = s.getText() if s is not None else "Missing data"
    print(f"{page}\n{' '.join(views.split())}")


def make_pages(soup_of_pages: BeautifulSoup) -> list:
    return [
        urljoin(base, item.select_one("a.question-hyperlink").get("href"))
        for item in soup_of_pages.select(".summary > h3")
    ]


def crawler(link):
    while True:
        soup = make_soup(link)
        threader(process_page, make_pages(soup), workers=10)
        next_page = soup.select_one(".pager > a[rel='next']")
        if not next_page:
            return
        link = urljoin(base, next_page.get("href"))


if __name__ == '__main__':
    threader(crawler, links)

示例运行输出：


Viewed 19 times

Viewed 32 times

Viewed 22 times

and more ...

理由：

从本质上讲，您在初始方法中所做的是催生工作人员从搜索页面获取问题 URL。您不处理以下网址。

我的建议是派生更多的工作人员来处理爬行工作人员收集的内容。

在你的问题中你提到：

I wish to get the titles from landing pages

这就是您的初始方法的调整版本试图通过使用 threader() 函数来实现的，它基本上是 ThreadPool().

的包装器

Answer 2

Python 不能很好地处理并发性，您可以通过让 Python 脚本处理单个 link 来绕过它，然后使用 Bash 来实现并发。这是一个例子：

python代码，我们称它为crawlLink.py:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import sys

base = 'https://whosebug.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

link = sys.argv[1]
get_titles(link)

bash脚本：

#! /bin/bash

links=""

for page in {1..5}
do
    links="${links} https://whosebug.com/questions/tagged/web-scraping?tab=newest&page=${page}&pagesize=30"
done

echo "${links}" | xargs  -i --max-procs=5 bash -c '/usr/bin/python3 crawlLink.py "{}"'

Answer 3

我希望您实际上没有像您的示例中所示那样生成单个线程:)

future_to_url = {executor.submit(get_titles,url): url for url in [link]}

通过简单地使用 Session（这意味着重新使用连接，明确声明您可以再次查询该站点，直到会话对象出现之前），您将加快很多明确地完成或垃圾收集）而不是普通的 requests.get() 调用。
事实上，由于 GIL，python 线程不擅长 CPU 绑定的任务（如 HTML 的解析），您可能应该使用 ProcessPoolExecutor用于解析，并留给 ThreadPoolExecutor（或什至单个线程）仅处理 HTTP 请求。

毕竟，我强烈建议看一下 aiohttp 作为 requests 的非阻塞继承者（实际上是 urllib，但 nvm）——它是内置的在 asyncio 之上，这样它就可以忘记线程安全问题、隐式锁等等......

In [3]: import aiohttp, asyncio, time
   ...: 
   ...: t0 = time.monotonic()
   ...: 
   ...: 
   ...: async def do_the_request(session):
   ...:     async with session.get("http://www.google.com") as resp:
   ...:         content = await resp.read()
   ...: 
   ...: 
   ...: async def main():
   ...:     async with aiohttp.ClientSession() as session:
   ...:         tasks = {asyncio.create_task(do_the_request(session)) for _ in r
   ...: ange(100)}
   ...:         await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED)
   ...: 
   ...: 
   ...: t0 = time.monotonic()
   ...: asyncio.run(main())
   ...: t1 = time.monotonic()
   ...: print(f"Time elapsed: {t1 - t0:.3f}")
   Time elapsed: 0.571

   In [4]:

最后，如果您达到 CPU 核心的极限（我敢打赌您会提前耗尽带宽）运行您的进程 - 只需使用 [=13= 启动另一个进程].

解析登陆页面的链接时无法提高性能

Unable to boost the performance while parsing links from landing pages

python

web-scraping

python-3.x

concurrent.futures