脚本即使在异步运行时也执行得非常慢

Question

我在 asyncio 中编写了一个与 相关的脚本aiohttp 异步解析网站内容的库。我尝试按照通常在 scrapy.

中应用的方式在以下脚本中应用逻辑

但是，当我执行我的脚本时，它就像 requests 或 [=40= 这样的同步库一样]urllib.request做。所以，很慢，达不到目的。

我知道我可以通过在 link 中定义所有下一页 link 来解决这个问题多变的。但是，我是否已经没有以正确的方式使用现有脚本执行任务？

在脚本中，processing_docs() 函数所做的是收集不同帖子的所有 links，并将精炼后的 links 传递给 fetch_again() 函数以获取目标页面的标题。 processing_docs() 函数中应用了一个逻辑，它收集 next_page link 并将其提供给 fetch() 函数以重复相同的操作。 This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.

我的问题是：如何在保持现有逻辑不变的情况下实现相同的目标？

import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = "https://whosebug.com/questions/tagged/web-scraping"

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            text = await response.text()
            result = await processing_docs(session, text)
        return result

async def processing_docs(session, html):
        tree = fromstring(html)
        titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
        for title in titles:
            await fetch_again(session,title)

        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])
            await fetch(page_link)

async def fetch_again(session,url):
    async with session.get(url) as response:
        text = await response.text()
        tree = fromstring(text)
        title = tree.cssselect("h1[itemprop='name'] a")[0].text
        print(title)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
    loop.close()

Answer 1

使用 asyncio 的全部意义在于您可以运行同时进行多个提取（彼此并行）。让我们看看您的代码：

for title in titles:
    await fetch_again(session,title)

这部分意味着每个新的 fetch_again 只有在上一个等待（完成）之后才会开始。如果你这样做，是的，与使用同步方法没有区别。

要调用 asyncio 的所有功能，请使用 asyncio.gather 并发启动多个提取：

await asyncio.gather(*[
    fetch_again(session,title)
    for title 
    in titles
])

您会看到显着的加速。

您可以继续活动并开始下一页的 fetch 和标题的 fetch_again：

async def processing_docs(session, html):
        coros = []

        tree = fromstring(html)

        # titles:
        titles = [
            urljoin(link,title.attrib['href']) 
            for title 
            in tree.cssselect(".summary .question-hyperlink")
        ]

        for title in titles:
            coros.append(
                fetch_again(session,title)
            )

        # next_page:
        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])

            coros.append(
                fetch(page_link)
            )

        # await:
        await asyncio.gather(*coros)

重要提示

虽然这种方法可以让您做事更快，但您可能希望同时限制并发请求的数量，以避免在您的机器和服务器上使用大量资源。

您可以使用 asyncio.Semaphore 来达到这个目的：

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result

脚本即使在异步运行时也执行得非常慢

Script performs very slowly even when it runs asynchronously

python

web-scraping

python-3.x

python-asyncio

aiohttp