从 HTML 文件顶部抓取 'dictionary' 类型对象(一堆文本,不在 class 中)

Scrape 'dictionary' type object from top of HTML file (bunch of text, not in a class)

考虑这个源代码: 查看源代码:http://www.steepandcheap.com/gear-cache/shop-smartwool-on-sale/SWL00II-GRA

顶部有一个 dictionary/JSON 类型的文本,以 "window.BC.product = "

开头

假设我有这个页面的汤对象。我如何提取顶部的文本并将其转换为 python 字典,以便我可以从中提取特定数据?

通过检查包含 "window.BC.product" 的文本找到 script

提取脚本内容后,使用正则表达式提取所需的javascript对象,然后通过json.loads()加载它以获取Python字典:

import json
import re
from bs4 import BeautifulSoup
import requests

pattern = re.compile(r"window\.BC\.product = (.*);", re.MULTILINE)

response = requests.get("http://www.steepandcheap.com/gear-cache/shop-smartwool-on-sale/SWL00II-GRA")
soup = BeautifulSoup(response.content)   

script = soup.find("script", text=lambda x: x and "window.BC.product" in x).text
data = json.loads(re.search(pattern, script).group(1))
print data

打印:

{u'features': [{u'name': u'Material', u'description': u'[shell] 86% polyester, ... u'Zippered back pocket\r', u'Reflective details']}