Python 请求 HTML - img src 被抓取数据：image/gif;base64

Question

我尝试使用请求 html 抓取产品图片（不能使用 BeautifulSoup，因为它使用 JavaScript 动态加载）。

我从产品页面找到并提取了图像 src 属性，内容如下：

images = r.html.find('img.product-media-gallery__item-image')
for image in images:
    print(image.attrs["src"])

但输出看起来像 this。我已经尝试用空白字符串替换小图像需要的字符串，但是没有从图像源中刮掉任何东西。

如何删除像素大小的图片，只保留有用的产品图片URL？

Answer 1

那些像素大小的图像是实际图像的占位符。正如您所说，数据是使用 JavaScript 动态加载的，这是获取图像链接的唯一方法。您可以通过解析 HTML 数据并从那里获取 JSON 链接来做到这一点。

首先下载您的网页HTML：

from requests import get

html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text

您可以使用正则表达式语句从 HTML 源代码中提取图像 JSON 数据，然后取消转义 HTML 编码的字符：

import re
from html import unescape

decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])

您现在可以像这样将 JSON 加载到 python 字典中：

from json import loads

json_data = loads(decoded_html)

然后简单地向下遍历 JSON，直到找到图像链接列表：

images = json_data[3]["options"]["images"]

print(images)

将所有内容放在一起，脚本如下所示：

from requests import get
import re
from html import unescape
from json import loads

# Download the page
html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text

# Decode the HTML and get the JSON
decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])

# Load it as a dictionary
json_data = loads(decoded_html)

# Get the image list
images = json_data[3]["options"]["images"]

print(images)

Python 请求 HTML - img src 被抓取数据：image/gif;base64

Python Requests HTML - img src gets scraped with data:image/gif;base64

python

web-scraping

python-requests-html