BeautifulSoup:缺少架构无效 url 错误
BeautifulSoup: Missing Schema invalid url error
我正在尝试使用 BeautifulSoup 从网页下载图像。我收到以下错误
MissingSchema: Invalid URL
import requests
from bs4 import BeautifulSoup
import os
from os.path import basename
url = "https://xxxxxx"
#r = requests.get(url)
request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link =image['src']
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
print(images)
我不确定为什么。我知道我可以很好地阅读图像,因为打印效果很好,直到我添加以下代码
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
如有任何帮助,我将不胜感激
谢谢
编辑
url是
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2018"
我按要求添加了打印link,输出如下
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg/280px-Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Bee_on_Lavender_Blossom_2.jpg/250px-Bee_on_Lavender_Blossom_2.jpg
编辑
我只是想知道它是否是 link 中名称的大小?在我们看到 jpeg 之前,它似乎被埋在了很多文件夹中?
这应该能奏效:
import re
import requests
from bs4 import BeautifulSoup
site = 'https://books.toscrape.com/'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regex didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
正如我根据错误所怀疑的那样,当您添加该打印语句时,您可以看到您尝试访问的链接不是有效的 url。
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
需要以 https:
.
开头
要解决此问题,只需将其添加到 image['src']
。
您需要解决的第二个问题是,当您写入文件时,您将其写入为'Natalya-Naryshkinajpg'
。您需要使用 jpg
作为文件扩展名:例如 'Natalya-Naryshkina.jpg'
我也修复了它。
代码:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2019"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link = 'https:' + image['src']
#print(link)
if 'static' not in link:
try:
extension = link.split('.')[-1]
with open(name.replace(' ', '-').replace('/', '') + '.' + extension, 'wb') as f:
im = requests.get(link, headers=headers)
f.write(im.content)
print(name)
except Exception as e:
print(e)
print(images)
我正在尝试使用 BeautifulSoup 从网页下载图像。我收到以下错误
MissingSchema: Invalid URL
import requests
from bs4 import BeautifulSoup
import os
from os.path import basename
url = "https://xxxxxx"
#r = requests.get(url)
request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link =image['src']
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
print(images)
我不确定为什么。我知道我可以很好地阅读图像,因为打印效果很好,直到我添加以下代码
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
如有任何帮助,我将不胜感激 谢谢
编辑
url是
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2018"
我按要求添加了打印link,输出如下
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg/280px-Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Bee_on_Lavender_Blossom_2.jpg/250px-Bee_on_Lavender_Blossom_2.jpg
编辑
我只是想知道它是否是 link 中名称的大小?在我们看到 jpeg 之前,它似乎被埋在了很多文件夹中?
这应该能奏效:
import re
import requests
from bs4 import BeautifulSoup
site = 'https://books.toscrape.com/'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regex didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
正如我根据错误所怀疑的那样,当您添加该打印语句时,您可以看到您尝试访问的链接不是有效的 url。
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
需要以 https:
.
要解决此问题,只需将其添加到 image['src']
。
您需要解决的第二个问题是,当您写入文件时,您将其写入为'Natalya-Naryshkinajpg'
。您需要使用 jpg
作为文件扩展名:例如 'Natalya-Naryshkina.jpg'
我也修复了它。
代码:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2019"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link = 'https:' + image['src']
#print(link)
if 'static' not in link:
try:
extension = link.split('.')[-1]
with open(name.replace(' ', '-').replace('/', '') + '.' + extension, 'wb') as f:
im = requests.get(link, headers=headers)
f.write(im.content)
print(name)
except Exception as e:
print(e)
print(images)