Python 使用等待进行网络抓取 Javascript

Python webscraping Javascript with Await

我在使用 Python 进行网页抓取时遇到问题。我正在尝试使用 from requests_html import AsyncHTMLSession.

https://www.nyse.com/ipo-center/filings 的第一个 table 获取数据

我的代码在这里:

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')

现在我有 2 个问题:

  1. 该网站通常不会 return 来自 table1 的任何有效信息,因此我无法获得 table 中的基础信息。到目前为止,我通过简单地等待几秒钟来绕过它,然后 运行 再次循环,直到数据帧被加载。不过可能不是最好的选择。
  2. 该代码在 Jupyter Notebook 中确实有效,但是一旦我将其以 .py 格式上传到我的服务器上,我收到错误消息 SyntaxError: 'await' outside async function

有没有人能解决上面提到的两个问题?

由于您使用协程,因此需要将它们包装在 async 函数中。见下面的例子

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
    r = await session.get(url)
    await r.html.arender(timeout=20)
    return r.text

data = session.run(get_page)

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)