如何使用 BeautifulSoup、xpath 或 css 选择器获取第一个相关标签
How to get the first related tags with BeautifulSoup, xpath or css selectors
<main>
<span>
<div id="1" class="infocard-list">
<span>
<div id="3" class="infocard-list">
</div>
</span>
<span>
<div id="4" class="infocard-list">
</div>
</span>
</div>
<div id="2" class="infocard-list">
<span>
<div id="5" class="infocard-list">
</div>
</span>
<span>
<div id="6" class="infocard-list">
</div>
</span>
</div>
</span
</main>
我正在做一个 scrapy 项目,我想要的是获取所有第一层 div.infocard-list 并从这些 div 获取其第一层 div.infocard-list 等等。
像这样:
def parse(content):
depth_divs = []
divs = content.xpath("get_layer_divs")
if divs:
for div in divs:
depth_divs.append(div.id)
next_layer_depth_list = parse(div)
if next_layer_depth_list:
depth_divs.append(next_layer_depth_list)
return depth_divs
上面的函数应该return这样:["1",["3","4"],"2",["5","6"]]
我尝试使用 css 选择器 content.css(" > div.infocard-list"),但出现语法错误,因为我没有在 "> 之前提供任何标记“而且我无法提供它,因为我正在处理特定的 html
尝试:
from bs4 import BeautifulSoup
html_doc = """
<main>
<span>
<div id="1" class="infocard-list">
<span>
<div id="3" class="infocard-list">
</div>
</span>
<span>
<div id="4" class="infocard-list">
</div>
</span>
</div>
<div id="2" class="infocard-list">
<span>
<div id="5" class="infocard-list">
</div>
</span>
<span>
<div id="6" class="infocard-list">
</div>
</span>
</div>
</span
</main>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def get_tree(soup, seen):
out = []
for d in soup.find_all("div", class_="infocard-list"):
if d not in seen:
seen.add(d)
out.append(d["id"])
t = get_tree(d, seen)
if t:
out.append(t)
return out
print(get_tree(soup, set()))
打印:
['1', ['3', '4'], '2', ['5', '6']]
<main>
<span>
<div id="1" class="infocard-list">
<span>
<div id="3" class="infocard-list">
</div>
</span>
<span>
<div id="4" class="infocard-list">
</div>
</span>
</div>
<div id="2" class="infocard-list">
<span>
<div id="5" class="infocard-list">
</div>
</span>
<span>
<div id="6" class="infocard-list">
</div>
</span>
</div>
</span
</main>
我正在做一个 scrapy 项目,我想要的是获取所有第一层 div.infocard-list 并从这些 div 获取其第一层 div.infocard-list 等等。
像这样:
def parse(content):
depth_divs = []
divs = content.xpath("get_layer_divs")
if divs:
for div in divs:
depth_divs.append(div.id)
next_layer_depth_list = parse(div)
if next_layer_depth_list:
depth_divs.append(next_layer_depth_list)
return depth_divs
上面的函数应该return这样:["1",["3","4"],"2",["5","6"]]
我尝试使用 css 选择器 content.css(" > div.infocard-list"),但出现语法错误,因为我没有在 "> 之前提供任何标记“而且我无法提供它,因为我正在处理特定的 html
尝试:
from bs4 import BeautifulSoup
html_doc = """
<main>
<span>
<div id="1" class="infocard-list">
<span>
<div id="3" class="infocard-list">
</div>
</span>
<span>
<div id="4" class="infocard-list">
</div>
</span>
</div>
<div id="2" class="infocard-list">
<span>
<div id="5" class="infocard-list">
</div>
</span>
<span>
<div id="6" class="infocard-list">
</div>
</span>
</div>
</span
</main>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def get_tree(soup, seen):
out = []
for d in soup.find_all("div", class_="infocard-list"):
if d not in seen:
seen.add(d)
out.append(d["id"])
t = get_tree(d, seen)
if t:
out.append(t)
return out
print(get_tree(soup, set()))
打印:
['1', ['3', '4'], '2', ['5', '6']]