龙卷风：来自迭代器的AsyncHttpClient.fetch？

Question

我正在尝试编写网络爬虫程序，并希望尽快发出 HTTP 请求。 tornado's AsyncHttpClient seems like a good choice, but all the example code I've seen (e.g. ) 基本上是在一个巨大的 URL 列表上调用 AsyncHttpClient.fetch 让 tornado 将它们排队并最终发出请求。

但是，如果我想处理来自文件或网络的无限长（或非常大）的 URL 列表怎么办？我不想将所有 URL 加载到内存中。

用谷歌搜索但似乎无法找到从迭代器 AsyncHttpClient.fetch 的方法。然而，我确实找到了一种使用 gevent 来做我想做的事情的方法：http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap。有没有办法在龙卷风中做类似的事情？

我想到的一个解决方案是最初只对这么多 URL 进行排队，然后添加逻辑以在 fetch 操作完成时对更多 URL 进行排队，但我希望有一个更简洁的解决方案。

如有任何帮助或建议，我们将不胜感激！

Answer 1

我会用一个队列和多个工人来做这个，在 https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

的变体中

import tornado.queues
from tornado import gen
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop

NUM_WORKERS = 10
QUEUE_SIZE = 100
q = tornado.queues.Queue(QUEUE_SIZE)
AsyncHTTPClient.configure(None, max_clients=NUM_WORKERS)
http_client = AsyncHTTPClient()

@gen.coroutine
def worker():
    while True:
        url = yield q.get()
        try:
            response = yield http_client.fetch(url)
            print('got response from', url)
        except Exception:
            print('failed to fetch', url)
        finally:
            q.task_done()

@gen.coroutine
def main():
    for i in range(NUM_WORKERS):
        IOLoop.current().spawn_callback(worker)
    with open("urls.txt") as f:
        for line in f:
            url = line.strip()
            # When the queue fills up, stop here to wait instead
            # of reading more from the file.
            yield q.put(url)
    yield q.join()

if __name__ == '__main__':
    IOLoop.current().run_sync(main)

龙卷风：来自迭代器的AsyncHttpClient.fetch？

tornado: AsyncHttpClient.fetch from an iterator?

python

asynchronous

tornado