Scrapy xpath 与两个 h2 标签之间的以下兄弟

Question

我有一个设计不佳的 HTML 页面，我正尝试使用 scrapy 从中提取数据。以下片段是我感兴趣的片段：

<html>
    <h2 class="schoolName">Graduate School of Business</h2>
        <ul title="Graduate School of Business departments - part 1"></ul>
        <ul title="Graduate School of Business departments - part 2"></ul>
        <ul title="Graduate School of Business departments - part 3"></ul>
   <h2 class="schoolName">School of Law</h2>
       <ul title="School of Law departments - part 1"></ul>
       <ul title="School of Law departments - part 2"></ul>
  <h2 class="schoolName">School of Medicine</h2>
      <ul title="School of Medicine departments - part 1"></ul>
</html>

我特别想知道学校的数量，每个学校下属的院系数量。所以我找到所有学校的列表如下：

>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']

然后对于每个学校我找到他们下面的部门如下：

>>> for school in schools:
...     print(school)
...     print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
...     print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 
 2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1', 
 'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine 
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------

这显然没有按预期工作，因为以下兄弟正在选择所有 ul 标签，而不仅仅是两个 h2 标签之间的标签.我该如何实现？

Answer 1

一种技术是选择一个标记新信息块开始的公共分隔符元素，使用 count() 和 preceding-sibling 测量其位置，然后 select 所有具有相同数量（加一）分隔符前面兄弟姐妹的数据元素。

在 iPython shell:

In [1]: from lxml import etree

In [2]: string = '''<html>
   ...:     <h2 class="schoolName">Graduate School of Business</h2>
   ...:         <ul title="Graduate School of Business departments - part 1"></ul>
   ...:         <ul title="Graduate School of Business departments - part 2"></ul>
   ...:         <ul title="Graduate School of Business departments - part 3"></ul>
   ...:    <h2 class="schoolName">School of Law</h2>
   ...:        <ul title="School of Law departments - part 1"></ul>
   ...:        <ul title="School of Law departments - part 2"></ul>
   ...:   <h2 class="schoolName">School of Medicine</h2>
   ...:       <ul title="School of Medicine departments - part 1"></ul>
   ...: </html>'''

In [3]: root = etree.fromstring(string)

In [4]: schools = root.xpath('//h2[@class="schoolName"]/text()')

In [5]: schools
Out[5]: ['Graduate School of Business', 'School of Law', 'School of Medicine']

In [6]: for school in schools:
   ...:     print (school)
   ...:     position = int(root.xpath(f'count(//h2[text()="{school}"]/preceding-sibling::h2) + 1'))
   ...:     print (f"Position: {position}")
   ...:     print (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))
   ...: 
Graduate School of Business
Position: 1
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 2', 'Graduate School of Business departments - part 3']
School of Law
Position: 2
['School of Law departments - part 1', 'School of Law departments - part 2']
School of Medicine
Position: 3
['School of Medicine departments - part 1']

Scrapy xpath 与两个 h2 标签之间的以下兄弟

Scrapy xpath with following sibling between two h2 tags

python

xpath

scrapy

web-scraping