解析登陆页面的链接时无法提高性能
Unable to boost the performance while parsing links from landing pages
我正在尝试使用 concurrent.futures
在以下脚本中实现多处理。问题是即使我使用 concurrent.futures
,性能仍然是一样的。它似乎对执行过程没有任何影响,这意味着它无法提高性能。
我知道如果我创建另一个函数并将从 get_titles()
填充的链接传递给该函数以便从它们的内页中抓取标题,我可以使这个 concurrent.futures
工作。但是,我希望使用我在下面创建的函数从登陆页面获取标题。
我使用迭代方法而不是递归只是因为如果我选择后者,当调用超过 1000 次时函数将抛出递归错误。
这是我迄今为止尝试过的方式 (the site link that I've used within the script is a placeholder
):
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
base = 'https://whosebug.com'
link = 'https://whosebug.com/questions/tagged/web-scraping'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
while True:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
next_page = soup.select_one(".pager > a[rel='next']")
if not next_page: return
link = urljoin(base,next_page.get("href"))
if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
futures.as_completed(future_to_url)
问题:
How can I improve the performance while parsing links from landing pages?
编辑:
我知道我可以按照下面的路线实现相同的目标,但是 这不是我最初尝试的样子
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
base = 'https://whosebug.com'
links = ['https://whosebug.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in links}
futures.as_completed(future_to_url)
既然您的抓取工具正在使用线程,为什么不“产生”更多的工作程序来处理来自着陆页的后续 URL?
例如:
import concurrent.futures as futures
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base = "https://whosebug.com"
links = [
f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
for i in range(1, 5)
]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}
def threader(function, target, workers=5):
with futures.ThreadPoolExecutor(max_workers=workers) as executor:
jobs = {executor.submit(function, item): item for item in target}
futures.as_completed(jobs)
def make_soup(page_url: str) -> BeautifulSoup:
return BeautifulSoup(requests.get(page_url).text, "html.parser")
def process_page(page: str):
s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
views = s.getText() if s is not None else "Missing data"
print(f"{page}\n{' '.join(views.split())}")
def make_pages(soup_of_pages: BeautifulSoup) -> list:
return [
urljoin(base, item.select_one("a.question-hyperlink").get("href"))
for item in soup_of_pages.select(".summary > h3")
]
def crawler(link):
while True:
soup = make_soup(link)
threader(process_page, make_pages(soup), workers=10)
next_page = soup.select_one(".pager > a[rel='next']")
if not next_page:
return
link = urljoin(base, next_page.get("href"))
if __name__ == '__main__':
threader(crawler, links)
示例 运行 输出:
Viewed 19 times
Viewed 32 times
Viewed 22 times
and more ...
理由:
从本质上讲,您在初始方法中所做的是催生工作人员从搜索页面获取问题 URL。您不处理以下网址。
我的建议是派生更多的工作人员来处理爬行工作人员收集的内容。
在你的问题中你提到:
I wish to get the titles from landing pages
这就是您的初始方法的调整版本试图通过使用 threader()
函数来实现的,它基本上是 ThreadPool()
.
的包装器
Python 不能很好地处理并发性,您可以通过让 Python 脚本处理单个 link 来绕过它,然后使用 Bash 来实现并发。这是一个例子:
python代码,我们称它为crawlLink.py
:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import sys
base = 'https://whosebug.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
link = sys.argv[1]
get_titles(link)
bash脚本:
#! /bin/bash
links=""
for page in {1..5}
do
links="${links} https://whosebug.com/questions/tagged/web-scraping?tab=newest&page=${page}&pagesize=30"
done
echo "${links}" | xargs -i --max-procs=5 bash -c '/usr/bin/python3 crawlLink.py "{}"'
- 我希望您实际上没有像您的示例中所示那样生成单个线程:)
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
- 通过简单地使用 Session(这意味着重新使用连接,明确声明您可以再次查询该站点,直到会话对象出现之前),您将加快很多明确地完成或垃圾收集)而不是普通的
requests.get()
调用。
- 事实上,由于 GIL,python 线程不擅长 CPU 绑定的任务(如 HTML 的解析),您可能应该使用
ProcessPoolExecutor
用于解析,并留给 ThreadPoolExecutor
(或什至单个线程)仅处理 HTTP 请求。
- 毕竟,我强烈建议看一下
aiohttp
作为 requests
的非阻塞继承者(实际上是 urllib
,但 nvm)——它是内置的在 asyncio
之上,这样它就可以忘记线程安全问题、隐式锁等等......
In [3]: import aiohttp, asyncio, time
...:
...: t0 = time.monotonic()
...:
...:
...: async def do_the_request(session):
...: async with session.get("http://www.google.com") as resp:
...: content = await resp.read()
...:
...:
...: async def main():
...: async with aiohttp.ClientSession() as session:
...: tasks = {asyncio.create_task(do_the_request(session)) for _ in r
...: ange(100)}
...: await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED)
...:
...:
...: t0 = time.monotonic()
...: asyncio.run(main())
...: t1 = time.monotonic()
...: print(f"Time elapsed: {t1 - t0:.3f}")
Time elapsed: 0.571
In [4]:
- 最后,如果您达到 CPU 核心的极限(我敢打赌您会提前耗尽带宽)运行 您的进程 - 只需使用 [=13= 启动另一个进程].
我正在尝试使用 concurrent.futures
在以下脚本中实现多处理。问题是即使我使用 concurrent.futures
,性能仍然是一样的。它似乎对执行过程没有任何影响,这意味着它无法提高性能。
我知道如果我创建另一个函数并将从 get_titles()
填充的链接传递给该函数以便从它们的内页中抓取标题,我可以使这个 concurrent.futures
工作。但是,我希望使用我在下面创建的函数从登陆页面获取标题。
我使用迭代方法而不是递归只是因为如果我选择后者,当调用超过 1000 次时函数将抛出递归错误。
这是我迄今为止尝试过的方式 (the site link that I've used within the script is a placeholder
):
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
base = 'https://whosebug.com'
link = 'https://whosebug.com/questions/tagged/web-scraping'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
while True:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
next_page = soup.select_one(".pager > a[rel='next']")
if not next_page: return
link = urljoin(base,next_page.get("href"))
if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
futures.as_completed(future_to_url)
问题:
How can I improve the performance while parsing links from landing pages?
编辑: 我知道我可以按照下面的路线实现相同的目标,但是 这不是我最初尝试的样子
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
base = 'https://whosebug.com'
links = ['https://whosebug.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in links}
futures.as_completed(future_to_url)
既然您的抓取工具正在使用线程,为什么不“产生”更多的工作程序来处理来自着陆页的后续 URL?
例如:
import concurrent.futures as futures
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base = "https://whosebug.com"
links = [
f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
for i in range(1, 5)
]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}
def threader(function, target, workers=5):
with futures.ThreadPoolExecutor(max_workers=workers) as executor:
jobs = {executor.submit(function, item): item for item in target}
futures.as_completed(jobs)
def make_soup(page_url: str) -> BeautifulSoup:
return BeautifulSoup(requests.get(page_url).text, "html.parser")
def process_page(page: str):
s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
views = s.getText() if s is not None else "Missing data"
print(f"{page}\n{' '.join(views.split())}")
def make_pages(soup_of_pages: BeautifulSoup) -> list:
return [
urljoin(base, item.select_one("a.question-hyperlink").get("href"))
for item in soup_of_pages.select(".summary > h3")
]
def crawler(link):
while True:
soup = make_soup(link)
threader(process_page, make_pages(soup), workers=10)
next_page = soup.select_one(".pager > a[rel='next']")
if not next_page:
return
link = urljoin(base, next_page.get("href"))
if __name__ == '__main__':
threader(crawler, links)
示例 运行 输出:
Viewed 19 times
Viewed 32 times
Viewed 22 times
and more ...
理由:
从本质上讲,您在初始方法中所做的是催生工作人员从搜索页面获取问题 URL。您不处理以下网址。
我的建议是派生更多的工作人员来处理爬行工作人员收集的内容。
在你的问题中你提到:
I wish to get the titles from landing pages
这就是您的初始方法的调整版本试图通过使用 threader()
函数来实现的,它基本上是 ThreadPool()
.
Python 不能很好地处理并发性,您可以通过让 Python 脚本处理单个 link 来绕过它,然后使用 Bash 来实现并发。这是一个例子:
python代码,我们称它为crawlLink.py
:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import sys
base = 'https://whosebug.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
def get_titles(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))
link = sys.argv[1]
get_titles(link)
bash脚本:
#! /bin/bash
links=""
for page in {1..5}
do
links="${links} https://whosebug.com/questions/tagged/web-scraping?tab=newest&page=${page}&pagesize=30"
done
echo "${links}" | xargs -i --max-procs=5 bash -c '/usr/bin/python3 crawlLink.py "{}"'
- 我希望您实际上没有像您的示例中所示那样生成单个线程:)
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
- 通过简单地使用 Session(这意味着重新使用连接,明确声明您可以再次查询该站点,直到会话对象出现之前),您将加快很多明确地完成或垃圾收集)而不是普通的
requests.get()
调用。 - 事实上,由于 GIL,python 线程不擅长 CPU 绑定的任务(如 HTML 的解析),您可能应该使用
ProcessPoolExecutor
用于解析,并留给ThreadPoolExecutor
(或什至单个线程)仅处理 HTTP 请求。 - 毕竟,我强烈建议看一下
aiohttp
作为requests
的非阻塞继承者(实际上是urllib
,但 nvm)——它是内置的在asyncio
之上,这样它就可以忘记线程安全问题、隐式锁等等......In [3]: import aiohttp, asyncio, time ...: ...: t0 = time.monotonic() ...: ...: ...: async def do_the_request(session): ...: async with session.get("http://www.google.com") as resp: ...: content = await resp.read() ...: ...: ...: async def main(): ...: async with aiohttp.ClientSession() as session: ...: tasks = {asyncio.create_task(do_the_request(session)) for _ in r ...: ange(100)} ...: await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED) ...: ...: ...: t0 = time.monotonic() ...: asyncio.run(main()) ...: t1 = time.monotonic() ...: print(f"Time elapsed: {t1 - t0:.3f}") Time elapsed: 0.571 In [4]:
- 最后,如果您达到 CPU 核心的极限(我敢打赌您会提前耗尽带宽)运行 您的进程 - 只需使用 [=13= 启动另一个进程].