使用 Soup 抓取图像

Question

我正在尝试从该网站抓取图像：https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst。当前代码为：

url = 'https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst'
soup = BeautifulSoup(urlopen(url), 'html.parser')
imgs = soup.findAll('div',  attrs = {'class': 'images is-flex flex-one has-flex-align-center has-flex-content-center'})

当我查看 imgs 的内部时，我找不到 image active ng-star-inserted ng-lazyloaded 和 srcset。结果，我无法下载图像。

有人可以建议如何解决这个问题吗？

Answer 1

可以使用xpath查找图片，使用requests获取图片然后写入文件如下

import requests
from lxml import html

# send request to website
r = requests.get("thewebsite")

# convert to html object
tree = html.fromstring(r.content)

# find images urls from xpath
image_urls = tree.xpath("xpaths/@href")

# write each image to your computer 
for i in image_urls:
    with open("filename","wb") as f:
        f.write(i)

Answer 2

图像延迟加载，我认为问题在于此。所以我抓取了加载和管理这些图片的脚本。

script = soup.find('script', {'type': 'application/ld+json'})
script_json = json.loads(script.contents[0])
imgs = script_json['@graph'][1]['photo']['url']

现在 imgs 包含您为该住宅提供的 link 所有 11 张图像的列表。

使用 Soup 抓取图像

Scrape an image using Soup

lazy-loading

beautifulsoup

web-scraping