Python 使用等待进行网络抓取 Javascript
Python webscraping Javascript with Await
我在使用 Python 进行网页抓取时遇到问题。我正在尝试使用 from requests_html import AsyncHTMLSession
.
从 https://www.nyse.com/ipo-center/filings 的第一个 table 获取数据
我的代码在这里:
from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession
#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()
#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()
#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")
#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')
现在我有 2 个问题:
- 该网站通常不会 return 来自
table1
的任何有效信息,因此我无法获得 table 中的基础信息。到目前为止,我通过简单地等待几秒钟来绕过它,然后 运行 再次循环,直到数据帧被加载。不过可能不是最好的选择。
- 该代码在 Jupyter Notebook 中确实有效,但是一旦我将其以 .py 格式上传到我的服务器上,我收到错误消息
SyntaxError: 'await' outside async function
。
有没有人能解决上面提到的两个问题?
由于您使用协程,因此需要将它们包装在 async
函数中。见下面的例子
from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession
#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()
#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
r = await session.get(url)
await r.html.arender(timeout=20)
return r.text
data = session.run(get_page)
#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")
#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)
我在使用 Python 进行网页抓取时遇到问题。我正在尝试使用 from requests_html import AsyncHTMLSession
.
我的代码在这里:
from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession
#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()
#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()
#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")
#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')
现在我有 2 个问题:
- 该网站通常不会 return 来自
table1
的任何有效信息,因此我无法获得 table 中的基础信息。到目前为止,我通过简单地等待几秒钟来绕过它,然后 运行 再次循环,直到数据帧被加载。不过可能不是最好的选择。 - 该代码在 Jupyter Notebook 中确实有效,但是一旦我将其以 .py 格式上传到我的服务器上,我收到错误消息
SyntaxError: 'await' outside async function
。
有没有人能解决上面提到的两个问题?
由于您使用协程,因此需要将它们包装在 async
函数中。见下面的例子
from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession
#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()
#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
r = await session.get(url)
await r.html.arender(timeout=20)
return r.text
data = session.run(get_page)
#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")
#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)