bs4 解析与浏览器不同的 html

bs4 parses different html than browser

我正在尝试使用 Beautifulsoup4 抓取 farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282),但我无法找到 [=] 的相同组件(通常是标签或文本) 29=]parsed 文本(转储到 soup.html)在开发工具视图中的浏览器中(当使用 CTRL + F 搜索匹配的字符串时)。

我的代码没有任何问题,但不管它是什么:

#!/usr/bin/python 
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup

# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")

# write parsed soup to file
with open("soup.html", "a") as dumpfile:
    dumpfile.write(str(page_soup))

当我将 soup.html 文件拖到浏览器中时,所有内容都会按预期加载(就像真正的 url 一样)。我认为这是对解析的某种保护?我试图建立一个连接 header,它告诉另一端的网络服务器我正在从一个真实的浏览器请求它,但它也没有工作。

  1. 有没有人遇到过类似的事情?
  2. 有没有办法获得浏览器中显示的 REAL html?

当我在浏览器中搜索想要的内容时,它(显然)显示...

此处将解析后的 html 保存为 "soup.html"。找不到我要找的内容,无论如何我搜索(CTRL+F)或bs4函数find_all()或find()等等。

根据您的评论,下面是一个示例,您可以如何从打折的产品中提取一些信息:

import requests
from bs4 import BeautifulSoup

url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):

    link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
    brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
    desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
    init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
    price = product.select_one('[data-test="price"]').get_text(strip=True)
    images = [i['content'] for i in product.select('meta[itemprop="image"]')]

    print('Link          :', link)
    print('Brand         :', brand)
    print('Description   :', desc)
    print('Initial price :', init_price)
    print('Price         :', price)
    print('Images        :', images)
    print('-' * 80)

打印:

Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand         : Dashiel Brahmann
Description   : printed button up shirt
Initial price : CHF 438
Price         : CHF 219
Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand         : Dashiel Brahmann
Description   : corduroy T-Shirt
Initial price : CHF 259
Price         : CHF 156
Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------

... and so on.

以下帮助了我:
而不是下面的代码

page_soup = soup(page_html, "html.parser")

使用

page_soup = soup(page_html, "html")