Select 在一组 <div> 和同一组 <li> 中具有相同 class 名称的第一个 <li>
Select every 1st <li> that has the same class name in a group of <div> with the same set of <li>
我正在努力学习使用 Python 和 BeautifulSoup。作为我自己的一个项目,我正在抓取一个食谱网站并在模板中显示某些项目以学习使用它。
该网站在 div[=34 中以 li 的形式显示膳食准备时间、卡路里和可以连续按食谱进食的人数=].
网站上的一个格子里有35个这样的div。我只想 select 将 div 中的备餐时间存储在列表中。所有 li 都具有相同的 class 并且没有其他属性。我如何只 select 我需要的 li?
在页面的 HTML 代码下方。其中有 35 个 div,每个都有不同的配方。
<div class="column xxlarge-4 large-6 small-12 ">
<a role="link" aria-label="Recept: 'Tiramisu' met advocaat" data-testhook="recipe-card" title="Recept: 'Tiramisu' met advocaat" href="/allerhande/recept/R-R1196417/tiramisu-met-advocaat" class="display-card_root__o17AY card_root__VNG0M card_roundCorners__dYaFu display-card_anchor__cTFon" data-analytics="LINK_CLICK" data-analytics-meta="%7B%22component%22%3A%22recipe-search%22%2C%22href%22%3A%22%2Fallerhande%2Frecept%2FR-R1196417%2Ftiramisu-met-advocaat%22%2C%22title%22%3A%22R-R1196417%22%7D">
<div class="display-card-section_section__42C0n display-card-body_body__r2mt4 card-body_root__E16CU">
<div class="ratio-box_root__YH5Fe ratio-box_ratio-21-10__thBP0">
<div class="ratio-box_content__k-Jz7">
<img class="card-image-set_imageSet__Su7xI lazyautosizes ls-is-cached lazyloaded" alt="'Tiramisu' met advocaat" data-srcset=", https://static.ah.nl/static/recepten/img_RAM_PRD163172_220x162_JPG.jpg 220w 162h, >
</div>
</div>
</div>
<footer class="display-card-section_section__42C0n display-card-section_padded__lHvvK display-card-footer_footer__cxMve card-footer_root__0dl7R">
<ul class="recipe-card-properties_root__rFiwt recipe-card-properties_allerhande__0gSBC" data-testhook="recipe-card-properties">
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_time" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_time">
</use>
</svg>
20 min
</li>
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_calories" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_calories">
</use>
</svg>
545 kcal
</li>
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_person" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_person">
</use>
</svg>
8</li>
</ul>
<p class="typography_root__Om3Wh typography_variant-paragraph__T5ZAU typography_hasMargin__4EaQi card-text_title__REC-7">
<span class="line-clamp_root__7DevG line-clamp_active__5Qc2L card-text_titleText__7T9sY card-text_boldTitle__SVYw2" data-testhook="recipe-card-title" style="-webkit-line-clamp: 2; line-height: 1.2em; max-height: 2.4em;">
'Tiramisu' met advocaat
</span>
</p>
</footer>
</a>
</div>
这是我用来减去我需要的信息的代码:
#Create soup
webpage_response = requests.get("https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING")
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
recipe_links = soup.find_all('a', attrs={'class' : re.compile('^display-card_root__.*')})
recipe_pictures = soup.find_all('img', attrs={'class' : re.compile('^card-image-set_imageSet__.*')})
recipe_prep_time = soup.find_all('li', attrs={'class' : re.compile('^recipe-card-properties_property__.*')})
但是:这 select 所有 li 项目,包括卡路里等,如果我想 select 从list.How 我可以 select 第一个 li?
您可以使用复合 class 中的 class 值之一,然后 next_sibling 移动到所需的文本
from bs4 import BeautifulSoup
import requests
webpage_response = requests.get("https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING")
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
for recipe in soup.select('[data-testhook="search-page-results"] .column:has(.svg--svg_time) '):
print(recipe.select_one('.svg--svg_time').next_sibling)
简单直接的解决方案:
recipe_prep_time = [ul.find('li').text
for ul in soup.find_all('ul',
attrs={'class': re.compile('^recipe-card-properties_root')})]
产量
['15 min',
'15 min',
'20 min',
'20 min',
'35 min',
'20 min',
'20 min',
'10 min',
...]
您可以使用 css 选择器,如下所示:
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING')
soup = bs(r.content, "html.parser")
for li in soup.select('li[class="recipe-card-properties_property__87cH1"]:nth-child(1)'):
print(li.text)
输出:
15 min
15 min
20 min
20 min
35 min
20 min
20 min
10 min
45 min
50 min
15 min
10 min
15 min
25 min
30 min
25 min
15 min
25 min
20 min
15 min
25 min
20 min
25 min
10 min
15 min
15 min
40 min
15 min
15 min
15 min
25 min
55 min
25 min
15 min
7 min
我正在努力学习使用 Python 和 BeautifulSoup。作为我自己的一个项目,我正在抓取一个食谱网站并在模板中显示某些项目以学习使用它。 该网站在 div[=34 中以 li 的形式显示膳食准备时间、卡路里和可以连续按食谱进食的人数=]. 网站上的一个格子里有35个这样的div。我只想 select 将 div 中的备餐时间存储在列表中。所有 li 都具有相同的 class 并且没有其他属性。我如何只 select 我需要的 li?
在页面的 HTML 代码下方。其中有 35 个 div,每个都有不同的配方。
<div class="column xxlarge-4 large-6 small-12 ">
<a role="link" aria-label="Recept: 'Tiramisu' met advocaat" data-testhook="recipe-card" title="Recept: 'Tiramisu' met advocaat" href="/allerhande/recept/R-R1196417/tiramisu-met-advocaat" class="display-card_root__o17AY card_root__VNG0M card_roundCorners__dYaFu display-card_anchor__cTFon" data-analytics="LINK_CLICK" data-analytics-meta="%7B%22component%22%3A%22recipe-search%22%2C%22href%22%3A%22%2Fallerhande%2Frecept%2FR-R1196417%2Ftiramisu-met-advocaat%22%2C%22title%22%3A%22R-R1196417%22%7D">
<div class="display-card-section_section__42C0n display-card-body_body__r2mt4 card-body_root__E16CU">
<div class="ratio-box_root__YH5Fe ratio-box_ratio-21-10__thBP0">
<div class="ratio-box_content__k-Jz7">
<img class="card-image-set_imageSet__Su7xI lazyautosizes ls-is-cached lazyloaded" alt="'Tiramisu' met advocaat" data-srcset=", https://static.ah.nl/static/recepten/img_RAM_PRD163172_220x162_JPG.jpg 220w 162h, >
</div>
</div>
</div>
<footer class="display-card-section_section__42C0n display-card-section_padded__lHvvK display-card-footer_footer__cxMve card-footer_root__0dl7R">
<ul class="recipe-card-properties_root__rFiwt recipe-card-properties_allerhande__0gSBC" data-testhook="recipe-card-properties">
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_time" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_time">
</use>
</svg>
20 min
</li>
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_calories" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_calories">
</use>
</svg>
545 kcal
</li>
<li class="recipe-card-properties_property__87cH1">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_person" viewBox="0 0 24 24" width="24" height="16">
<use xlink:href="#svg_person">
</use>
</svg>
8</li>
</ul>
<p class="typography_root__Om3Wh typography_variant-paragraph__T5ZAU typography_hasMargin__4EaQi card-text_title__REC-7">
<span class="line-clamp_root__7DevG line-clamp_active__5Qc2L card-text_titleText__7T9sY card-text_boldTitle__SVYw2" data-testhook="recipe-card-title" style="-webkit-line-clamp: 2; line-height: 1.2em; max-height: 2.4em;">
'Tiramisu' met advocaat
</span>
</p>
</footer>
</a>
</div>
这是我用来减去我需要的信息的代码:
#Create soup
webpage_response = requests.get("https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING")
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
recipe_links = soup.find_all('a', attrs={'class' : re.compile('^display-card_root__.*')})
recipe_pictures = soup.find_all('img', attrs={'class' : re.compile('^card-image-set_imageSet__.*')})
recipe_prep_time = soup.find_all('li', attrs={'class' : re.compile('^recipe-card-properties_property__.*')})
但是:这 select 所有 li 项目,包括卡路里等,如果我想 select 从list.How 我可以 select 第一个 li?
您可以使用复合 class 中的 class 值之一,然后 next_sibling 移动到所需的文本
from bs4 import BeautifulSoup
import requests
webpage_response = requests.get("https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING")
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
for recipe in soup.select('[data-testhook="search-page-results"] .column:has(.svg--svg_time) '):
print(recipe.select_one('.svg--svg_time').next_sibling)
简单直接的解决方案:
recipe_prep_time = [ul.find('li').text
for ul in soup.find_all('ul',
attrs={'class': re.compile('^recipe-card-properties_root')})]
产量
['15 min',
'15 min',
'20 min',
'20 min',
'35 min',
'20 min',
'20 min',
'10 min',
...]
您可以使用 css 选择器,如下所示:
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING')
soup = bs(r.content, "html.parser")
for li in soup.select('li[class="recipe-card-properties_property__87cH1"]:nth-child(1)'):
print(li.text)
输出:
15 min
15 min
20 min
20 min
35 min
20 min
20 min
10 min
45 min
50 min
15 min
10 min
15 min
25 min
30 min
25 min
15 min
25 min
20 min
15 min
25 min
20 min
25 min
10 min
15 min
15 min
40 min
15 min
15 min
15 min
25 min
55 min
25 min
15 min
7 min