如何打开divclass内的所有href？

Question

我是 python 的新手，我想解析 div class 中的所有 href。我的目标是创建一个程序来打开 div class 中的所有 link，以便能够保存与 href 关联的照片。

link：https://www.opi.com/shop-products/nail-polish-powders/nail-lacquer

我要解析的部分是“div-id:all_nail_lacquer”

到目前为止，我能够获得所有 href，这是我目前所拥有的：

import urllib
import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.opi.com/shop-products/nail-polish-powders/nail-lacquer"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")

print(soup.title.text)

nail_lacquer = (soup.find('div', {"id":"all_nail_lacquer"}))

"""
for nail_lacquer in soup.find_all('div'):
    print(nail_lacquer.findAll('a')
"""

for a in soup.findAll('div', {"id":"all_nail_lacquer"}):
    for b in a.findAll('a'):
        print(b.get('href'))

Answer 1

要打印图像链接（即使是 hi-res 图像）和标题，您可以使用此脚本：

import urllib
import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.opi.com/shop-products/nail-polish-powders/nail-lacquer"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")

for img in soup.select('#all_nail_lacquer [typeof="foaf:Image"][data-src]'):
    print(img['data-src'])
    print(img['data-src'].replace('shelf_image', 'photos')) # <-- this is URL to hi-res image
    print(img['title'])
    print('-' * 80)

打印：

https://www.opi.com/sites/default/files/styles/product_shelf_image/public/baby-take-a-vow-nlsh1-nail-lacquer-22850011001_0_0.jpg?itok=3b2ftHzc
https://www.opi.com/sites/default/files/styles/product_photos/public/baby-take-a-vow-nlsh1-nail-lacquer-22850011001_0_0.jpg?itok=3b2ftHzc
Baby, Take a Vow
--------------------------------------------------------------------------------
https://www.opi.com/sites/default/files/styles/product_shelf_image/public/suzi-without-a-paddle-nlf88-nail-lacquer-22006698188_21_0.jpg?itok=mgi1-rz3
https://www.opi.com/sites/default/files/styles/product_photos/public/suzi-without-a-paddle-nlf88-nail-lacquer-22006698188_21_0.jpg?itok=mgi1-rz3
Suzi Without a Paddle
--------------------------------------------------------------------------------
https://www.opi.com/sites/default/files/styles/product_shelf_image/public/coconuts-over-opi-nlf89-nail-lacquer-22006698189_24_1_0.jpg?itok=yasOZA4l
https://www.opi.com/sites/default/files/styles/product_photos/public/coconuts-over-opi-nlf89-nail-lacquer-22006698189_24_1_0.jpg?itok=yasOZA4l
Coconuts Over OPI
--------------------------------------------------------------------------------
https://www.opi.com/sites/default/files/styles/product_shelf_image/public/no-tan-lines-nlf90-nail-lacquer-22006698190_20_1_0.jpg?itok=ot_cu8c5
https://www.opi.com/sites/default/files/styles/product_photos/public/no-tan-lines-nlf90-nail-lacquer-22006698190_20_1_0.jpg?itok=ot_cu8c5
No Tan Lines
--------------------------------------------------------------------------------


...and so on.

编辑：要将图像保存到磁盘，您可以使用此脚本：

import requests
from bs4 import BeautifulSoup

theurl = "https://www.opi.com/shop-products/nail-polish-powders/nail-lacquer"
thepage = requests.get(theurl)
soup = BeautifulSoup(thepage.content, "html.parser")

i = 1
for img in soup.select('#all_nail_lacquer [typeof="foaf:Image"][data-src]'):
    u = img['data-src'].replace('shelf_image', 'photos')
    with open('img_{:04d}.jpg'.format(i), 'wb') as f_out:
        print('Saving {}'.format(u))
        f_out.write(requests.get(u).content)
    i += 1

如何打开divclass内的所有href？

How to open all href within a div class?

python

parsing

screen-scraping

href