尝试从一个 csv 文件中抓取多个 url。但是面对所有 url 的响应 404,除了从 csv 文件加载的最后一个 url
Trying to scrape multiple urls from a csv file. But facing response 404 for all urls except last url which loads from csv file
import requests
from bs4 import BeautifulSoup
import csv
import lxml
with open('xyz/spec.csv') as file:
reqdata = []
for line in file:
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(line, headers=headers)
soup = BeautifulSoup(r.text, "lxml")
need = soup.find_all('span', attrs={"class":"10965hju"})
needs = []
for tit in need:
needs.append(tit.text.strip())
reqdata.append(needs)
print(reqdata)
因为您只是从 csv 文件中读取行,所以您的所有 url 的末尾都有一个换行符 (\n
),最后一个除外。
最简单的解决方案是
r = requests.get(line.strip(), headers=headers)
strip 不带参数会删除前导和尾随空格。
见下文
(如果您仍然遇到问题,您必须分享 spec.csv
)
import requests
from bs4 import BeautifulSoup
with open('xyz/spec.csv') as file:
reqdata = []
headers = {'User-Agent': 'Mozilla/5.0'}
for line in [l.strip() for l in file.readlines()]:
r = requests.get(line, headers=headers)
if r.status_code == 200:
soup = BeautifulSoup(r.text, "lxml")
need = soup.find_all('span', attrs={"class": "10965hju"})
needs = []
for tit in need:
needs.append(tit.text.strip())
reqdata.append(needs)
print(reqdata)
import requests
from bs4 import BeautifulSoup
import csv
import lxml
with open('xyz/spec.csv') as file:
reqdata = []
for line in file:
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(line, headers=headers)
soup = BeautifulSoup(r.text, "lxml")
need = soup.find_all('span', attrs={"class":"10965hju"})
needs = []
for tit in need:
needs.append(tit.text.strip())
reqdata.append(needs)
print(reqdata)
因为您只是从 csv 文件中读取行,所以您的所有 url 的末尾都有一个换行符 (\n
),最后一个除外。
最简单的解决方案是
r = requests.get(line.strip(), headers=headers)
strip 不带参数会删除前导和尾随空格。
见下文
(如果您仍然遇到问题,您必须分享 spec.csv
)
import requests
from bs4 import BeautifulSoup
with open('xyz/spec.csv') as file:
reqdata = []
headers = {'User-Agent': 'Mozilla/5.0'}
for line in [l.strip() for l in file.readlines()]:
r = requests.get(line, headers=headers)
if r.status_code == 200:
soup = BeautifulSoup(r.text, "lxml")
need = soup.find_all('span', attrs={"class": "10965hju"})
needs = []
for tit in need:
needs.append(tit.text.strip())
reqdata.append(needs)
print(reqdata)