从 HTML 文件顶部抓取 'dictionary' 类型对象（一堆文本，不在 class 中）

Question

考虑这个源代码：查看源代码：http://www.steepandcheap.com/gear-cache/shop-smartwool-on-sale/SWL00II-GRA

顶部有一个 dictionary/JSON 类型的文本，以 "window.BC.product = "

开头

假设我有这个页面的汤对象。我如何提取顶部的文本并将其转换为 python 字典，以便我可以从中提取特定数据？

Answer 1

通过检查包含 "window.BC.product" 的文本找到 script。

提取脚本内容后，使用正则表达式提取所需的javascript对象，然后通过json.loads()加载它以获取Python字典：

import json
import re
from bs4 import BeautifulSoup
import requests

pattern = re.compile(r"window\.BC\.product = (.*);", re.MULTILINE)

response = requests.get("http://www.steepandcheap.com/gear-cache/shop-smartwool-on-sale/SWL00II-GRA")
soup = BeautifulSoup(response.content)   

script = soup.find("script", text=lambda x: x and "window.BC.product" in x).text
data = json.loads(re.search(pattern, script).group(1))
print data

打印：

{u'features': [{u'name': u'Material', u'description': u'[shell] 86% polyester, ... u'Zippered back pocket\r', u'Reflective details']}

从 HTML 文件顶部抓取 'dictionary' 类型对象（一堆文本，不在 class 中）

Scrape 'dictionary' type object from top of HTML file (bunch of text, not in a class)

python

beautifulsoup

scrapy

web-scraping

python-2.7