如何使用 BeautifulSoup、xpath 或 css 选择器获取第一个相关标签

How to get the first related tags with BeautifulSoup, xpath or css selectors

<main>
   <span>
     <div id="1" class="infocard-list">
       <span>
         <div id="3" class="infocard-list">
         </div>
       </span>
       <span>
         <div id="4" class="infocard-list">
         </div>
       </span>
    </div>
    <div id="2" class="infocard-list">
       <span>
         <div id="5" class="infocard-list">
         </div>
       </span>
       <span>
         <div id="6" class="infocard-list">
         </div>
       </span>
    </div>
  </span
</main>

我正在做一个 scrapy 项目,我想要的是获取所有第一层 div.infocard-list 并从这些 div 获取其第一层 div.infocard-list 等等。

像这样:

def parse(content):
   depth_divs = []
   divs = content.xpath("get_layer_divs")
   if divs:
     for div in divs:
       depth_divs.append(div.id)
       next_layer_depth_list = parse(div)
       if next_layer_depth_list:
          depth_divs.append(next_layer_depth_list)
     
     return depth_divs

上面的函数应该return这样:["1",["3","4"],"2",["5","6"]]

我尝试使用 css 选择器 content.css(" > div.infocard-list"),但出现语法错误,因为我没有在 "> 之前提供任何标记“而且我无法提供它,因为我正在处理特定的 html

尝试:

from bs4 import BeautifulSoup

html_doc = """
<main>
   <span>
     <div id="1" class="infocard-list">
       <span>
         <div id="3" class="infocard-list">
         </div>
       </span>
       <span>
         <div id="4" class="infocard-list">
         </div>
       </span>
    </div>
    <div id="2" class="infocard-list">
       <span>
         <div id="5" class="infocard-list">
         </div>
       </span>
       <span>
         <div id="6" class="infocard-list">
         </div>
       </span>
    </div>
  </span
</main>
"""

soup = BeautifulSoup(html_doc, "html.parser")


def get_tree(soup, seen):
    out = []
    for d in soup.find_all("div", class_="infocard-list"):
        if d not in seen:
            seen.add(d)
            out.append(d["id"])
            t = get_tree(d, seen)
            if t:
                out.append(t)
    return out


print(get_tree(soup, set()))

打印:

['1', ['3', '4'], '2', ['5', '6']]