我怎样才能得到这个网站上的所有 href？

Question

通常我可以找到所有的 href，但我的 scirpt 没有抓取任何东西，我想不通这是为什么？

这是我的脚本：

import warnings
warnings.filterwarnings("ignore")

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

url = "https://www.frayssinet-joaillier.fr/fr/marques/longines"
soup = BeautifulSoup(requests.get(url).content, "html.parser")



#get the links

all_title = soup.find_all('a', class_ = 'prod-item__container')

data_titles = []
for title in all_title:
    try:
        product_link = title['href']
        data_titles.append(product_link)
    except:
        pass

print(data_titles)

data = pd.DataFrame({
    'links' : data_titles
    })

data.to_csv("testlink.csv", sep=';', index=False)

这是 html :

似乎 soup.find_all('a', class_ = 'prod-item__container') 应该可以，但实际上没有。

知道为什么吗？

Answer 1

在您的请求中使用一些 headers 来获取内容 - 一些网站根据 user-agent 提供不同的响应以避免抓取或抓取 - read more:

headers = {'User-Agent': 'Mozilla/5.0'}
url = "https://www.frayssinet-joaillier.fr/fr/marques/longines"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

例子

headers = {'User-Agent': 'Mozilla/5.0'}
url = "https://www.frayssinet-joaillier.fr/fr/marques/longines"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

#get the links

all_title = soup.find_all('a', class_ = 'prod-item__container')

data_titles = []
for title in all_title:
    try:
        product_link = title['href']
        data_titles.append(product_link)
    except:
        pass

print(data_titles)

Answer 2

要获取数据，我们需要传递此网站的 user-agent 详细信息。使用下面的代码。

url = "https://www.frayssinet-joaillier.fr/fr/marques/longines" header = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36',

} soup = BeautifulSoup(requests.get(url, headers =header).content, "html.parser")

我怎样才能得到这个网站上的所有 href？

How can I get all the href on this web site?

beautifulsoup

web-scraping

python-3.x

例子