如何使用 Beautiful Soup 提取这本词典

Question

我想获取字典的可变最后一个元素（粘贴在下面），它在另一个字典“offers”中，但我不知道如何提取它。

    html = s.get(url=url, headers=headers, verify=False, timeout=15)
    soup = BeautifulSoup(html.text, 'html.parser')
    products = soup.find_all('script', {'type': "application/ld+json"})

{"@context":"http://schema.org","@type":"Product","aggregateRating":{"@type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"@type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

Answer 1

如前所述，通过 BeautifulSoup 提取内容将 string 解码为 json.loads():

import json

products = '{"@context":"http://schema.org","@type":"Product","aggregateRating":{"@type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"@type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'

products = json.loads(products)

获取报价中的最后一个元素（dict）：

products['offers'][-1]

输出：

{'@type': 'Offer',
 'availability': 'http://schema.org/OutOfStock',
 'price': '489',
 'priceCurrency': 'PLN',
 'sku': 'NE215O06U-A110002000',
 'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}

例子

在您的特殊情况下，您还必须先 replace('"','"')：

from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')

jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('&quot;','"'))

jsonData['offers'][-1]

如何使用 Beautiful Soup 提取这本词典

How to extract this dictionary using Beautiful Soup

python

web-scraping

web

例子