bs4 解析与浏览器不同的 html
bs4 parses different html than browser
我正在尝试使用 Beautifulsoup4 抓取 farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282),但我无法找到 [=] 的相同组件(通常是标签或文本) 29=]parsed 文本(转储到 soup.html)在开发工具视图中的浏览器中(当使用 CTRL + F 搜索匹配的字符串时)。
我的代码没有任何问题,但不管它是什么:
#!/usr/bin/python
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup
# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")
# write parsed soup to file
with open("soup.html", "a") as dumpfile:
dumpfile.write(str(page_soup))
当我将 soup.html 文件拖到浏览器中时,所有内容都会按预期加载(就像真正的 url 一样)。我认为这是对解析的某种保护?我试图建立一个连接 header,它告诉另一端的网络服务器我正在从一个真实的浏览器请求它,但它也没有工作。
- 有没有人遇到过类似的事情?
- 有没有办法获得浏览器中显示的 REAL html?
当我在浏览器中搜索想要的内容时,它(显然)显示...
此处将解析后的 html 保存为 "soup.html"。找不到我要找的内容,无论如何我搜索(CTRL+F)或bs4函数find_all()或find()等等。
根据您的评论,下面是一个示例,您可以如何从打折的产品中提取一些信息:
import requests
from bs4 import BeautifulSoup
url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
price = product.select_one('[data-test="price"]').get_text(strip=True)
images = [i['content'] for i in product.select('meta[itemprop="image"]')]
print('Link :', link)
print('Brand :', brand)
print('Description :', desc)
print('Initial price :', init_price)
print('Price :', price)
print('Images :', images)
print('-' * 80)
打印:
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : printed button up shirt
Initial price : CHF 438
Price : CHF 219
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : corduroy T-Shirt
Initial price : CHF 259
Price : CHF 156
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------
... and so on.
以下帮助了我:
而不是下面的代码
page_soup = soup(page_html, "html.parser")
使用
page_soup = soup(page_html, "html")
我正在尝试使用 Beautifulsoup4 抓取 farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282),但我无法找到 [=] 的相同组件(通常是标签或文本) 29=]parsed 文本(转储到 soup.html)在开发工具视图中的浏览器中(当使用 CTRL + F 搜索匹配的字符串时)。
我的代码没有任何问题,但不管它是什么:
#!/usr/bin/python
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup
# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")
# write parsed soup to file
with open("soup.html", "a") as dumpfile:
dumpfile.write(str(page_soup))
当我将 soup.html 文件拖到浏览器中时,所有内容都会按预期加载(就像真正的 url 一样)。我认为这是对解析的某种保护?我试图建立一个连接 header,它告诉另一端的网络服务器我正在从一个真实的浏览器请求它,但它也没有工作。
- 有没有人遇到过类似的事情?
- 有没有办法获得浏览器中显示的 REAL html?
当我在浏览器中搜索想要的内容时,它(显然)显示...
此处将解析后的 html 保存为 "soup.html"。找不到我要找的内容,无论如何我搜索(CTRL+F)或bs4函数find_all()或find()等等。
根据您的评论,下面是一个示例,您可以如何从打折的产品中提取一些信息:
import requests
from bs4 import BeautifulSoup
url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
price = product.select_one('[data-test="price"]').get_text(strip=True)
images = [i['content'] for i in product.select('meta[itemprop="image"]')]
print('Link :', link)
print('Brand :', brand)
print('Description :', desc)
print('Initial price :', init_price)
print('Price :', price)
print('Images :', images)
print('-' * 80)
打印:
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : printed button up shirt
Initial price : CHF 438
Price : CHF 219
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : corduroy T-Shirt
Initial price : CHF 259
Price : CHF 156
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------
... and so on.
以下帮助了我:
而不是下面的代码
page_soup = soup(page_html, "html.parser")
使用
page_soup = soup(page_html, "html")