BeautifulSoup:缺少架构无效 url 错误

BeautifulSoup: Missing Schema invalid url error

我正在尝试使用 BeautifulSoup 从网页下载图像。我收到以下错误

MissingSchema: Invalid URL
import requests
from bs4 import BeautifulSoup
import os
from os.path  import basename



url = "https://xxxxxx"

#r = requests.get(url)

request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
soup = BeautifulSoup(page_html, 'html.parser')

#print(soup.title.text)
images = soup.find_all('img')
for image in images:
    name = image['alt']
    link =image['src']
    with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
        im = requests.get(link)
        f.write(im.content)
    

print(images)

我不确定为什么。我知道我可以很好地阅读图像,因为打印效果很好,直到我添加以下代码

with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
        im = requests.get(link)
        f.write(im.content)

如有任何帮助,我将不胜感激 谢谢


编辑


url是

url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2018"

我按要求添加了打印link,输出如下

//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg/280px-Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Bee_on_Lavender_Blossom_2.jpg/250px-Bee_on_Lavender_Blossom_2.jpg

编辑


我只是想知道它是否是 link 中名称的大小?在我们看到 jpeg 之前,它似乎被埋在了很多文件夹中?

这应该能奏效:

import re
import requests
from bs4 import BeautifulSoup

site = 'https://books.toscrape.com/'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

正如我根据错误所怀疑的那样,当您添加该打印语句时,您可以看到您尝试访问的链接不是有效的 url。

//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg 需要以 https:.

开头

要解决此问题,只需将其添加到 image['src']

您需要解决的第二个问题是,当您写入文件时,您将其写入为'Natalya-Naryshkinajpg'。您需要使用 jpg 作为文件扩展名:例如 'Natalya-Naryshkina.jpg' 我也修复了它。

代码:

import requests
from bs4 import BeautifulSoup


url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2019"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')

#print(soup.title.text)
images = soup.find_all('img')
for image in images:
    name = image['alt']
    link = 'https:' + image['src']
    #print(link)
    if 'static' not in link:
        try:
            extension = link.split('.')[-1]
            with open(name.replace(' ', '-').replace('/', '') + '.' + extension, 'wb') as f:
                im = requests.get(link, headers=headers)
                f.write(im.content)
                print(name)
        except Exception as e:
            print(e)
    
print(images)