Web 抓取 - 如何通过 Angular.js 访问 JavaScript 中呈现的内容？

Question

我正在尝试从 public 站点抓取数据 asx.com.au

页面 http://www.asx.com.au/asx/research/company.do#!/ACB/details 包含 div 和 class 'view-content'，其中包含我需要的信息：

但是当我尝试通过 Python 的 urllib2.urlopen 查看此页面时，div 是空的：

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.asx.com.au/asx/research/company.do#!/ACB/details'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
contentDiv = soup.find("div", {"class": "view-content"})
print(contentDiv)

# the results is an empty div:
# <div class="view-content" ui-view=""></div>

是否可以通过编程方式访问 div 的内容？

编辑：根据评论，内容似乎是通过 Angular.js 呈现的。是否可以通过 Python 触发该内容的呈现？

Answer 1

本页面使用JavaScript从服务器读取数据并填写页面。

我看到您在 Chrome 中使用开发人员工具 - 请参阅 XHR 或 JS 请求中的选项卡 Network。

我找到这个 url:

http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices&callback=angular.callbacks._0

这个url几乎以JSON格式给出了所有数据

但是如果你在没有 &callback=angular.callbacks._0 的情况下使用这个 link 那么你将获得纯 JSON 格式的数据，你可以使用 json 模块将其转换为 python字典.

编辑： 工作代码

import urllib2
import json

# new url      
url = 'http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices'

# read all data
page = urllib2.urlopen(url).read()

# convert json text to python dictionary
data = json.loads(page)

print(data['principal_activities'])

输出：

Mineral exploration in Botswana, China and Australia.

编辑 (2020.12.23)

这个答案已有将近 5 年的历史，是为 Python2 创建的。现在 Python3 需要 urllib.request.urlopen() 或 requests.get() 但真正的问题是 5 年来这个页面改变了结构和技术。网址（在问题和答案中）不再存在。此页面需要新的分析和新的方法。

有问题的是url

http://www.asx.com.au/asx/research/company.do#!/ACB/details

但当前页面使用 url

https://www2.asx.com.au/markets/company/acb

并且AJAX、XHR

使用不同的urls

https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/announcements
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/key-statistics
等等

您可以在 Chrome/Firefox 中使用 DevTools 找到更多 url（选项卡：Network，过滤器：XHR）

import urllib.request
import json

# new url      
url = 'https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about'

# read all data
page = urllib.request.urlopen(url).read()

# convert json text to python dictionary
data = json.loads(page)

print(data['data']['description'])

输出：

Minerals exploration & development

Web 抓取 - 如何通过 Angular.js 访问 JavaScript 中呈现的内容？

Web scraping - how to access content rendered in JavaScript via Angular.js?

python

urllib2

beautifulsoup

web-scraping

angularjs