从网站上的搜索查询中提取产品 URL

Question

例如，如果我想跟踪 https://www.gear4music.com/Studio-MIDI-Controllers 上 MIDI 键盘的价格变化。我需要从搜索中提取产品图片的所有 URL，然后遍历产品的 URL，并提取每个产品的价格信息。我可以通过对 URL 进行硬编码来获取单个产品的价格数据，但是我找不到自动获取多个产品的 URL 的方法。

到目前为止我已经试过了，

from bs4 import BeautifulSoup
import requests

url = "https://www.gear4music.com/Studio-MIDI- Controllers"

response = requests.get(url)

data = response.text

soup = BeautifulSoup(data, 'lxml')

tags = soup.find_all('a')

for tag in tags:
    print(tag.get('href'))

这确实生成了 URL 的列表，但我无法确定哪些与我想获取其价格产品信息的搜索查询中的 MIDI 键盘具体相关。是否有更好更具体的方法来仅获取产品的 URL 而不是 HTML 文件中的所有内容？

Answer 1

获取商品链接的方式有很多种。一种方法是 select 所有具有 data-g4m-inv= 属性的 <a> 标签：

import requests
from bs4 import BeautifulSoup

url = "https://www.gear4music.com/Studio-MIDI-Controllers"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select("a[data-g4m-inv]"):
    print("https://www.gear4music.com" + a["href"])

打印：

https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR
https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA
https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L
https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X

...

Answer 2

打开 chrome 开发人员控制台并查看与产品对应的 div，从那里设置一个等于 soup.find_all(前面提到的 div) 并循环遍历这些结果以找到该元素的 children 标签（或者识别标题 class 并以这种方式搜索）。

从网站上的搜索查询中提取产品 URL

Extracting product URLs from a search query on a website

python

beautifulsoup

python-requests