并发期货网络抓取
Concurrent futures webscraping
我目前正在尝试开发一个快速的网络抓取功能,这样我就可以抓取大量文件。
这是我目前拥有的代码:
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ProcessPoolExecutor, as_completed
def parse(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup.find_all('a')
with ProcessPoolExecutor(max_workers=4) as executor:
start = time.time()
futures = [ executor.submit(parse, url) for url in URLs ]
results = []
for result in as_completed(futures):
results.append(result)
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
这会返回网站的结果,即 www.google.com,
但是我的问题是我不知道查看它带回的数据
我只得到未来的对象。
请问有人 explain/show 我该怎么做。
感谢您随时帮助我。
你也可以通过字典理解来实现它,如下所示。
with ProcessPoolExecutor(max_workers=4) as executor:
start = time.time()
futures = { executor.submit(parse, url): url for url in URLs }
for result in as_completed(futures):
link = futures.get(result)
try:
data = result.result()
except Exception as e:
print(e)
else:
print("Link: {}, data: {}".format(link, data))
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
我目前正在尝试开发一个快速的网络抓取功能,这样我就可以抓取大量文件。
这是我目前拥有的代码:
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ProcessPoolExecutor, as_completed
def parse(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup.find_all('a')
with ProcessPoolExecutor(max_workers=4) as executor:
start = time.time()
futures = [ executor.submit(parse, url) for url in URLs ]
results = []
for result in as_completed(futures):
results.append(result)
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
这会返回网站的结果,即 www.google.com, 但是我的问题是我不知道查看它带回的数据 我只得到未来的对象。
请问有人 explain/show 我该怎么做。
感谢您随时帮助我。
你也可以通过字典理解来实现它,如下所示。
with ProcessPoolExecutor(max_workers=4) as executor:
start = time.time()
futures = { executor.submit(parse, url): url for url in URLs }
for result in as_completed(futures):
link = futures.get(result)
try:
data = result.result()
except Exception as e:
print(e)
else:
print("Link: {}, data: {}".format(link, data))
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))