Select 只有某些图片带有 Python 图片网络抓取工具
Select only certain images with Python image webscraper
我正在尝试在 python 中创建一个简单的网络爬虫,用于查找、下载和创建网站中某些图像的 pdf。现在我只创建了代码的网络抓取部分:
import requests
from bs4 import BeautifulSoup
import numpy as np
url = 'website url'
page = requests.get(url)
print('=== website ===\n',url)
soup = BeautifulSoup(page.content, 'html.parser')
images = soup.find_all('img')
print('=== images found ===')
for img in images:
if img.has_attr('src'):
print(img['src'])
这是我得到的:
=== website ===
https://ita.net/stop-1/
=== images found ===
https://ita.net/wp-content/uploads/2021/09/021-5.jpg
https://ita.net/wp-content/uploads/2021/09/021-5-430x350.jpg
https://ita.net/wp-content/uploads/2021/09/004-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/005-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/006-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/007-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/008-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/009-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/010-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/011-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/012-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/013-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/014-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/015-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/016-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/017-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/018-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/019-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/020-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/021-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/022-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/023-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/024-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/025-4-722x1024.jpg
https://ita.net/wp-content/uploads/2022/03/ita-sidebar-5.jpg
https://ita.net/wp-content/uploads/2022/03/telegram-1.jpg
https://ita.net/wp-content/uploads/2021/11/ita-logo-w-1-1024x311.png
https://ita.net/wp-content/uploads/2021/11/premium-1024x407.png
我指定“某些”是因为我的代码找到了网站中的所有图像并显示它们。但是,我只希望显示(并选择)以 722x1024.jpg
结尾的图像。
有人知道怎么做吗?
imgs = []
for img in images:
if img.has_attr('src'):
if "722x1024.jpg" in img['src']:
imgs.append(img['src'])
或者:
img_list = soup.find_all(
lambda tag:tag.name == 'img' and
'src' in tag.attrs and '722x1024.jpg' in tag.attrs['src'])
首先:您可以使用 {'src': True}
获取具有 src
.
的图像
因为 src
是一个 string
,所以您可以使用任何 string
函数 - 即。 .endswith()
images = soup.find_all('img', {'src': True})
for img in images:
if img['src'].endswith('722x1024.jpg'):
print(img['src']))
BeautifulSoup
还允许在 find:
中使用函数
def check(src):
return (src is not None) and src.endswith('722x1024.jpg')
images = soup.find_all('img', {'src': check})
for img in images:
print(img['src'])
或 lambda
images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('722x1024.jpg')})
for img in images:
print(img['src'])
也可以用regex
import re
images = soup.find_all('img', {'src': re.compile('722x1024.jpg$')})
for img in images:
print(img['src'])
最简单的工作示例。
我在books.toscrape.com created (by authors of module scrapy)上搜索0.jpg
,专门学习抓取。
(另见 toscrape.com)
import requests
from bs4 import BeautifulSoup
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
print('--- version 1 ---')
images = soup.find_all('img', {'src': True})
for img in images:
if img['src'].endswith('0.jpg'):
print(img['src'])
print('--- version 2 a ---')
def check(src):
return (src is not None) and src.endswith('0.jpg')
images = soup.find_all('img', {'src': check})
for img in images:
print(img['src'])
print('--- version 2 b ---')
images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('0.jpg')})
for img in images:
print(img['src'])
print('--- version 3 ---')
import re
images = soup.find_all('img', {'src': re.compile('0.jpg$')})
for img in images:
print(img['src'])
结果:
--- version 1 ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 2 a ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 2 b ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 3 ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
我正在尝试在 python 中创建一个简单的网络爬虫,用于查找、下载和创建网站中某些图像的 pdf。现在我只创建了代码的网络抓取部分:
import requests
from bs4 import BeautifulSoup
import numpy as np
url = 'website url'
page = requests.get(url)
print('=== website ===\n',url)
soup = BeautifulSoup(page.content, 'html.parser')
images = soup.find_all('img')
print('=== images found ===')
for img in images:
if img.has_attr('src'):
print(img['src'])
这是我得到的:
=== website ===
https://ita.net/stop-1/
=== images found ===
https://ita.net/wp-content/uploads/2021/09/021-5.jpg
https://ita.net/wp-content/uploads/2021/09/021-5-430x350.jpg
https://ita.net/wp-content/uploads/2021/09/004-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/005-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/006-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/007-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/008-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/009-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/010-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/011-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/012-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/013-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/014-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/015-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/016-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/017-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/018-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/019-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/020-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/021-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/022-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/023-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/024-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/025-4-722x1024.jpg
https://ita.net/wp-content/uploads/2022/03/ita-sidebar-5.jpg
https://ita.net/wp-content/uploads/2022/03/telegram-1.jpg
https://ita.net/wp-content/uploads/2021/11/ita-logo-w-1-1024x311.png
https://ita.net/wp-content/uploads/2021/11/premium-1024x407.png
我指定“某些”是因为我的代码找到了网站中的所有图像并显示它们。但是,我只希望显示(并选择)以 722x1024.jpg
结尾的图像。
有人知道怎么做吗?
imgs = []
for img in images:
if img.has_attr('src'):
if "722x1024.jpg" in img['src']:
imgs.append(img['src'])
或者:
img_list = soup.find_all(
lambda tag:tag.name == 'img' and
'src' in tag.attrs and '722x1024.jpg' in tag.attrs['src'])
首先:您可以使用 {'src': True}
获取具有 src
.
因为 src
是一个 string
,所以您可以使用任何 string
函数 - 即。 .endswith()
images = soup.find_all('img', {'src': True})
for img in images:
if img['src'].endswith('722x1024.jpg'):
print(img['src']))
BeautifulSoup
还允许在 find:
def check(src):
return (src is not None) and src.endswith('722x1024.jpg')
images = soup.find_all('img', {'src': check})
for img in images:
print(img['src'])
或 lambda
images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('722x1024.jpg')})
for img in images:
print(img['src'])
也可以用regex
import re
images = soup.find_all('img', {'src': re.compile('722x1024.jpg$')})
for img in images:
print(img['src'])
最简单的工作示例。
我在books.toscrape.com created (by authors of module scrapy)上搜索0.jpg
,专门学习抓取。
(另见 toscrape.com)
import requests
from bs4 import BeautifulSoup
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
print('--- version 1 ---')
images = soup.find_all('img', {'src': True})
for img in images:
if img['src'].endswith('0.jpg'):
print(img['src'])
print('--- version 2 a ---')
def check(src):
return (src is not None) and src.endswith('0.jpg')
images = soup.find_all('img', {'src': check})
for img in images:
print(img['src'])
print('--- version 2 b ---')
images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('0.jpg')})
for img in images:
print(img['src'])
print('--- version 3 ---')
import re
images = soup.find_all('img', {'src': re.compile('0.jpg$')})
for img in images:
print(img['src'])
结果:
--- version 1 ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 2 a ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 2 b ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
--- version 3 ---
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg