使用 beautifulsoup 提取长属性值
Extracting a long attribute value with beautifulsoup
重新编辑
好的,我需要解析一些网站,你能帮我解析这个奇怪的东西吗?
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div>
我只需要这个div,关于图像的信息,link图像,我该怎么做?
from bs4 import BeautifulSoup
x = '''
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div> '''
soup = BeautifulSoup(x, 'html5lib')
div = soup.find('div', attrs = {'class':'cloudzoom-gallery e-item-card-photos-small_item'})
print(div.img['src'])
print(div.img['title'])
print(div.img['alt'])
这将打印图像 url、标题和替代值:
/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg
Product1
LGTV
Post 你的评论,因为你想要更大的图像,所以有一个可识别的模式可以用来完成工作:
1) 文件名相同:66ef9b3de11aeaba1bc50a42a1c8b880
除了在较小的图像上附加了下划线和大小。
2) 文件夹名称是文件名的前两个字母,在本例中为 66
。
3) 大图的文件路径相同,只是在中间附加了大小,如 32x44
基于这些,我们可以轻松地为更大的图像创建路径,例如:
from bs4 import BeautifulSoup
x = '''
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div> '''
soup = BeautifulSoup(x, 'html5lib')
div = soup.find('div', attrs = {'class':'cloudzoom-gallery e-item-card-photos-small_item'})
file_name = div.img['src'].split("_")[0].split("/")[-1]
extension = div.img['src'].split(".")[-1]
folder_name = file_name[0:2]
final_file_path = "/upload/" + folder_name + "/" + file_name + "." + extension
print(final_file_path)
这会打印:
/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg
另一个更简单的选择是简单地获取 div 字符串并像这样适当地拆分它:
print(x.split("image:")[1].split(",")[0])
这将打印图像 url:
"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg"
Beautiful soup 提供了一种获取数据属性的方法,如下所示:
div.attrs['data-cloudzoom']
但是由于这里的data属性还有un-escaped双引号中的双引号,beautiful soup在这里是行不通的。您还可以注意到,由于这个原因,您在上面发布的 html 无法从 Whosebug 获得正确的颜色突出显示。
重新编辑
好的,我需要解析一些网站,你能帮我解析这个奇怪的东西吗?
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div>
我只需要这个div,关于图像的信息,link图像,我该怎么做?
from bs4 import BeautifulSoup
x = '''
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div> '''
soup = BeautifulSoup(x, 'html5lib')
div = soup.find('div', attrs = {'class':'cloudzoom-gallery e-item-card-photos-small_item'})
print(div.img['src'])
print(div.img['title'])
print(div.img['alt'])
这将打印图像 url、标题和替代值:
/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg
Product1
LGTV
Post 你的评论,因为你想要更大的图像,所以有一个可识别的模式可以用来完成工作:
1) 文件名相同:66ef9b3de11aeaba1bc50a42a1c8b880
除了在较小的图像上附加了下划线和大小。
2) 文件夹名称是文件名的前两个字母,在本例中为 66
。
3) 大图的文件路径相同,只是在中间附加了大小,如 32x44
基于这些,我们可以轻松地为更大的图像创建路径,例如:
from bs4 import BeautifulSoup
x = '''
<div class="cloudzoom-gallery e-item-card-photos-small_item" data-cloudzoom="
useZoom:"#item_card_zoom", image:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg", zoomImage:"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg""> <img width="44" src="/upload/66/32x44/66ef9b3de11aeaba1bc50a42a1c8b880_32x44.jpg" title="Product1" alt="LGTV"></div> '''
soup = BeautifulSoup(x, 'html5lib')
div = soup.find('div', attrs = {'class':'cloudzoom-gallery e-item-card-photos-small_item'})
file_name = div.img['src'].split("_")[0].split("/")[-1]
extension = div.img['src'].split(".")[-1]
folder_name = file_name[0:2]
final_file_path = "/upload/" + folder_name + "/" + file_name + "." + extension
print(final_file_path)
这会打印:
/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg
另一个更简单的选择是简单地获取 div 字符串并像这样适当地拆分它:
print(x.split("image:")[1].split(",")[0])
这将打印图像 url:
"/upload/66/66ef9b3de11aeaba1bc50a42a1c8b880.jpg"
Beautiful soup 提供了一种获取数据属性的方法,如下所示:
div.attrs['data-cloudzoom']
但是由于这里的data属性还有un-escaped双引号中的双引号,beautiful soup在这里是行不通的。您还可以注意到,由于这个原因,您在上面发布的 html 无法从 Whosebug 获得正确的颜色突出显示。