bs4 解析与浏览器不同的 html

Question

我正在尝试使用 Beautifulsoup4 抓取 farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282)，但我无法找到 [=] 的相同组件（通常是标签或文本） 29=]parsed 文本（转储到 soup.html）在开发工具视图中的浏览器中（当使用 CTRL + F 搜索匹配的字符串时）。

我的代码没有任何问题，但不管它是什么：

#!/usr/bin/python 
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup

# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")

# write parsed soup to file
with open("soup.html", "a") as dumpfile:
    dumpfile.write(str(page_soup))

当我将 soup.html 文件拖到浏览器中时，所有内容都会按预期加载（就像真正的 url 一样）。我认为这是对解析的某种保护？我试图建立一个连接 header，它告诉另一端的网络服务器我正在从一个真实的浏览器请求它，但它也没有工作。

有没有人遇到过类似的事情？
有没有办法获得浏览器中显示的 REAL html？

当我在浏览器中搜索想要的内容时，它（显然）显示...

此处将解析后的 html 保存为 "soup.html"。找不到我要找的内容，无论如何我搜索（CTRL+F）或bs4函数find_all（）或find（）等等。

Answer 1

根据您的评论，下面是一个示例，您可以如何从打折的产品中提取一些信息：

import requests
from bs4 import BeautifulSoup

url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):

    link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
    brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
    desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
    init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
    price = product.select_one('[data-test="price"]').get_text(strip=True)
    images = [i['content'] for i in product.select('meta[itemprop="image"]')]

    print('Link          :', link)
    print('Brand         :', brand)
    print('Description   :', desc)
    print('Initial price :', init_price)
    print('Price         :', price)
    print('Images        :', images)
    print('-' * 80)

打印：

Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand         : Dashiel Brahmann
Description   : printed button up shirt
Initial price : CHF 438
Price         : CHF 219
Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand         : Dashiel Brahmann
Description   : corduroy T-Shirt
Initial price : CHF 259
Price         : CHF 156
Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------

... and so on.

Answer 2

以下帮助了我：
而不是下面的代码

page_soup = soup(page_html, "html.parser")

使用

page_soup = soup(page_html, "html")

bs4 解析与浏览器不同的 html

bs4 parses different html than browser

web-applications

beautifulsoup

html-parsing

web-scraping

python-3.x