BeautifulSoup img src 获取 base64 而不是实际 link
BeautifulSoup img src gets base64 instead of actual link
我不熟悉使用 bs4 进行网络抓取,我想从蛋白质数据库 (PDB) 获取蛋白质图像:
https://www.rcsb.org/structure/1A69
当我使用 Chrome Inspector 检查 HTML 时,我发现图像是通过 http link 获取的,我可以轻松访问并从中保存图像。
<img class="img-responsive center-block mainImage"
src="https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb1-500.jpg">
然而,当我 运行 我的脚本提取 src 时,我只得到它作为 base64。

我是不是做错了什么?发生了什么?有没有办法从 base64 返回 http link?
我的代码:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
url = "https://www.rcsb.org/structure/1A69"
resp = urlopen(url)
page = bs(resp,"html.parser")
for img in page.findAll('img',{'class':'img-responsive'}):
src = img['src']
print(src)
图像 URL 由 Javascript 动态组成,但您可以使用此 Python 脚本模拟组成:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rcsb.org/structure/1A69'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
pdb_id = url.split('/')[-1].lower()
images_location = "https://cdn.rcsb.org/images/rutgers/"
num_items = len( soup.select('#carousel-structuregallery .item') )
pdb_hash = pdb_id[1:3]
# print image urls to screen:
for i in range(num_items):
# 0 = Asymmetric; 1+ = Biological Assembly
if i == 0:
img_url = images_location + pdb_hash + '/' + pdb_id + '/' + pdb_id + '.pdb-500.jpg'
else:
img_url = images_location + pdb_hash + '/' + pdb_id + '/' + pdb_id + '.pdb' + str(i) + '-500.jpg'
print(img_url)
打印:
https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb-500.jpg
https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb1-500.jpg
如果你想看到 none-base64 图像试试这个:
import requests as re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
url = "https://www.rcsb.org/structure/1A69"
resp = urlopen(url)
soup = bs(resp, 'html.parser')
images = []
for img in soup.findAll('img'):
images.append(img.get('src'))
for i in images:
i = 'http:' + i
try:
response = re.get(i)
if response.status_code == 200:
print(i)
except:
continue
输出:
http://files.rcsb.org/pub/pdb/validation_reports/a6/1a69/1a69_multipercentile_validation.png
http://cdn.rcsb.org/rcsb-pdb/explorer/SSPv2/images/MendeleyIcon.png
http://cdn.rcsb.org/rcsb-pdb/explorer/SSPv2/images/EndNoteIcon.png
http://cdn.rcsb.org/images/ccd/unlabeled/F/FMB.svg
http://cdn.rcsb.org/images/ccd/unlabeled/S/SO4.svg
http://files.rcsb.org/pub/pdb/validation_reports/a6/1a69/1a69_multipercentile_validation.png
我不熟悉使用 bs4 进行网络抓取,我想从蛋白质数据库 (PDB) 获取蛋白质图像:
https://www.rcsb.org/structure/1A69
当我使用 Chrome Inspector 检查 HTML 时,我发现图像是通过 http link 获取的,我可以轻松访问并从中保存图像。
<img class="img-responsive center-block mainImage"
src="https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb1-500.jpg">
然而,当我 运行 我的脚本提取 src 时,我只得到它作为 base64。

我是不是做错了什么?发生了什么?有没有办法从 base64 返回 http link?
我的代码:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
url = "https://www.rcsb.org/structure/1A69"
resp = urlopen(url)
page = bs(resp,"html.parser")
for img in page.findAll('img',{'class':'img-responsive'}):
src = img['src']
print(src)
图像 URL 由 Javascript 动态组成,但您可以使用此 Python 脚本模拟组成:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rcsb.org/structure/1A69'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
pdb_id = url.split('/')[-1].lower()
images_location = "https://cdn.rcsb.org/images/rutgers/"
num_items = len( soup.select('#carousel-structuregallery .item') )
pdb_hash = pdb_id[1:3]
# print image urls to screen:
for i in range(num_items):
# 0 = Asymmetric; 1+ = Biological Assembly
if i == 0:
img_url = images_location + pdb_hash + '/' + pdb_id + '/' + pdb_id + '.pdb-500.jpg'
else:
img_url = images_location + pdb_hash + '/' + pdb_id + '/' + pdb_id + '.pdb' + str(i) + '-500.jpg'
print(img_url)
打印:
https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb-500.jpg
https://cdn.rcsb.org/images/rutgers/a6/1a69/1a69.pdb1-500.jpg
如果你想看到 none-base64 图像试试这个:
import requests as re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
url = "https://www.rcsb.org/structure/1A69"
resp = urlopen(url)
soup = bs(resp, 'html.parser')
images = []
for img in soup.findAll('img'):
images.append(img.get('src'))
for i in images:
i = 'http:' + i
try:
response = re.get(i)
if response.status_code == 200:
print(i)
except:
continue
输出:
http://files.rcsb.org/pub/pdb/validation_reports/a6/1a69/1a69_multipercentile_validation.png
http://cdn.rcsb.org/rcsb-pdb/explorer/SSPv2/images/MendeleyIcon.png
http://cdn.rcsb.org/rcsb-pdb/explorer/SSPv2/images/EndNoteIcon.png
http://cdn.rcsb.org/images/ccd/unlabeled/F/FMB.svg
http://cdn.rcsb.org/images/ccd/unlabeled/S/SO4.svg
http://files.rcsb.org/pub/pdb/validation_reports/a6/1a69/1a69_multipercentile_validation.png