尝试从一个 csv 文件中抓取多个 url。但是面对所有 url 的响应 404,除了从 csv 文件加载的最后一个 url

Trying to scrape multiple urls from a csv file. But facing response 404 for all urls except last url which loads from csv file

import requests
from bs4 import BeautifulSoup
import csv
import lxml
with open('xyz/spec.csv') as file:
    reqdata = []
    for line in file:
        headers = {'User-Agent': 'Mozilla/5.0'}
        r = requests.get(line, headers=headers)
        soup = BeautifulSoup(r.text, "lxml")
        need = soup.find_all('span', attrs={"class":"10965hju"})
        needs = []
        for tit in need:
            needs.append(tit.text.strip())
        reqdata.append(needs)
        
    print(reqdata)

因为您只是从 csv 文件中读取行,所以您的所有 url 的末尾都有一个换行符 (\n),最后一个除外。

最简单的解决方案是

r = requests.get(line.strip(), headers=headers)

strip 不带参数会删除前导和尾随空格。

见下文

(如果您仍然遇到问题,您必须分享 spec.csv

import requests
from bs4 import BeautifulSoup

with open('xyz/spec.csv') as file:
    reqdata = []
    headers = {'User-Agent': 'Mozilla/5.0'}
    for line in [l.strip() for l in file.readlines()]:
        r = requests.get(line, headers=headers)
        if r.status_code == 200:
            soup = BeautifulSoup(r.text, "lxml")
            need = soup.find_all('span', attrs={"class": "10965hju"})
            needs = []
            for tit in need:
                needs.append(tit.text.strip())
            reqdata.append(needs)

print(reqdata)