我的脚本在应该 运行 异步时遇到错误
My script encounters an error when it is supposed to run asynchronously
我在 python 中编写了一个脚本,使用 asyncio
与 aiohttp
库的关联来解析弹出框中的名称,这些弹出框在点击不同的联系人信息按钮时启动位于 this website 中的 table 中的机构信息是异步的。该网页显示表格内容,共 513 页。
我在尝试使用 asyncio.get_event_loop()
时遇到了这个错误 too many file descriptors in select()
,但是当我遇到 时,我可以看到有人建议使用 asyncio.ProactorEventLoop()
来避免这样的错误所以我使用了后者但注意到,即使我遵守了建议,脚本也只从几页中收集名称,直到它抛出以下错误。我该如何解决这个问题?
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]
这是我目前的尝试:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
async def get_links(url):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception: name = ""
print(name)
if __name__ == '__main__':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
简而言之,process_docs()
函数的作用是收集每个页面的 data-id
数字,以将它们重新用作此 https://www.tursab.org.tr/en/displayAcenta?AID={}
的前缀 link 来收集名称从弹出框。一个这样的 id 是 8757
和一个这样的合格的 links 因此 https://www.tursab.org.tr/en/displayAcenta?AID=8757
.
顺便说一句,如果我将 links
变量中使用的最高数字更改为 20 或 30 左右,它会顺利进行。
async def get_links(url):
async with asyncio.Semaphore(10):
你不能做这样的事情:这意味着在每次函数调用时都会创建新的信号量实例,而你需要为所有请求创建一个信号量实例。以这种方式更改您的代码:
sem = asyncio.Semaphore(10) # module level
async def get_links(url):
async with sem:
# ...
async def fetch_again(link):
async with sem:
# ...
一旦正确使用信号量,您还可以return默认循环:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(...)
最后,你应该改变get_links(url)
和fetch_again(link)
在信号量之外进行解析,以便在process_docs(text)
内部需要信号量之前尽快释放它。[=17] =]
最终代码:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
sem = asyncio.Semaphore(10)
async def get_links(url):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception:
name = "o"
print(name)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
我在 python 中编写了一个脚本,使用 asyncio
与 aiohttp
库的关联来解析弹出框中的名称,这些弹出框在点击不同的联系人信息按钮时启动位于 this website 中的 table 中的机构信息是异步的。该网页显示表格内容,共 513 页。
我在尝试使用 asyncio.get_event_loop()
时遇到了这个错误 too many file descriptors in select()
,但是当我遇到 asyncio.ProactorEventLoop()
来避免这样的错误所以我使用了后者但注意到,即使我遵守了建议,脚本也只从几页中收集名称,直到它抛出以下错误。我该如何解决这个问题?
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]
这是我目前的尝试:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
async def get_links(url):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception: name = ""
print(name)
if __name__ == '__main__':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
简而言之,process_docs()
函数的作用是收集每个页面的 data-id
数字,以将它们重新用作此 https://www.tursab.org.tr/en/displayAcenta?AID={}
的前缀 link 来收集名称从弹出框。一个这样的 id 是 8757
和一个这样的合格的 links 因此 https://www.tursab.org.tr/en/displayAcenta?AID=8757
.
顺便说一句,如果我将 links
变量中使用的最高数字更改为 20 或 30 左右,它会顺利进行。
async def get_links(url):
async with asyncio.Semaphore(10):
你不能做这样的事情:这意味着在每次函数调用时都会创建新的信号量实例,而你需要为所有请求创建一个信号量实例。以这种方式更改您的代码:
sem = asyncio.Semaphore(10) # module level
async def get_links(url):
async with sem:
# ...
async def fetch_again(link):
async with sem:
# ...
一旦正确使用信号量,您还可以return默认循环:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(...)
最后,你应该改变get_links(url)
和fetch_again(link)
在信号量之外进行解析,以便在process_docs(text)
内部需要信号量之前尽快释放它。[=17] =]
最终代码:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
sem = asyncio.Semaphore(10)
async def get_links(url):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception:
name = "o"
print(name)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))