有没有一种方法可以在使用 bs4 进行网络抓取时查找特定的代码行

Is there a way to look for a specific line of code in web-scraping with bs4

我正在尝试使用 html 列表和无序列表

抓取页面

(嵌套在列表和无序列表中)

但是如果没有这些属性,我无法对它们进行网络抓取。

一天下的每个 <ul> 标签都包含当天的数据。我知道如何抓取嵌套的 <ul><li> 标签,但由于缺少属性而无法这样做。我想知道我是否可以获取已解析的页面并在包含日期的行下查找标签,这样我就可以一次抓取一个。任何帮助将不胜感激。

代码有点多,

<div class="show-content user_content clearfix enhanced" data-uw-styling-context="true">
  <h1 class="page-title" data-uw-styling-context="true">Unit 3 I Week 3</h1>
  
  
    <div style="background-color: #184366; color: white; padding: 15px;" data-uw-styling-context="true">
<h2 data-uw-styling-context="true"><span style="font-size: 30pt;" data-uw-styling-context="true">Unit 3 | Week 3: January 18th-21st</span></h2>
</div>
<h2 data-uw-styling-context="true">Essential Questions</h2>
<ul data-uw-styling-context="true">
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How does voice relate to the audience and purpose?</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">What techniques does the author use to get his/her point across and communicate?</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How can technology be beneficial and/or detrimental to society?</span></li>
</ul>
<h2 data-uw-styling-context="true">Objectives</h2>
<ul data-uw-styling-context="true">
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze the concept of utopia/dystopia as presented in the novel</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a utopia to represent the ideas of the group and backed up with research</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze expository/informational text&nbsp;</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Understand rhetorical devices and logical fallacies</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Interpret elements of media including television and digital graphics</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a TV newscast that organizes and presents research with certain purposes and audiences in mind</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Collaborate to create a professional product</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Explain author’s purpose and message within a text</span></li>
</ul>
<p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p>
<h2 data-uw-styling-context="true"> Monday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">No School</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true"> Tuesday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Read Chapter 4</li>
<li data-uw-styling-context="true">Annotations&nbsp;</li>
<li data-uw-styling-context="true">Book Study</li>
</ul>
</li>
<li data-uw-styling-context="true">Due Today:</li>
<li data-uw-styling-context="true">Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems</li>
<li data-uw-styling-context="true">Annotations and Book Study 1-4 due BOC Wed</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true"> Wednesday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Subject Complement Notes&nbsp;</li>
<li data-uw-styling-context="true">"There Will Come Soft Rains"&nbsp;</li>
</ul>
</li>
<li data-uw-styling-context="true">Due Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Annotations and Book Study Ch. 1-4</li>
</ul>
</li>
<li data-uw-styling-context="true">Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems&nbsp;</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true"> Thursday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Subject Complement Practice</li>
<li data-uw-styling-context="true">TWCSR</li>
</ul>
</li>
<li data-uw-styling-context="true">Due Today:</li>
<li data-uw-styling-context="true">Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems&nbsp;</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true"> Friday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Stems Quiz 5 Major Grade</li>
<li data-uw-styling-context="true">TWCSR (Due Monday BOC)</li>
</ul>
</li>
<li data-uw-styling-context="true">Due Today:</li>
<li data-uw-styling-context="true">Homework for Next Class:</li>
</ul>
</li>
</ul>
<p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p>
<p data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791827/download" alt="Left Arrow (1).png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791827" data-api-returntype="File" data-uw-styling-context="true"></p>
<p data-uw-styling-context="true"><br data-uw-styling-context="true">&nbsp;<a title="Unit 3 Overview" href="https://fisd.instructure.com/courses/111538/pages/unit-3-overview" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/unit-3-overview" data-api-returntype="Page" data-uw-styling-context="true">Unit 3 Homepage</a></p>
<p data-uw-styling-context="true">&nbsp;</p>
<p data-uw-styling-context="true"><a title="Home" href="https://fisd.instructure.com/courses/111538/pages/home" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/home" data-api-returntype="Page" data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791834/download?wrap=1" alt="Home Black.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791834" data-api-returntype="File" data-uw-styling-context="true"> <br data-uw-styling-context="true">Course Homepage</a></p>
<p data-uw-styling-context="true">&nbsp;</p>
  
</div>

这是页面的屏幕截图,

注意: 由于缺乏详细信息,答案只能指向如何在上下文中抓取信息的方向 - 但确实如此不考虑网站的路径,也不考虑数据的准确准备。

方法是查找所有 <h2> 包含“day”、它的下一个 <li> 及其所有子项 <li>:

for day in soup.select('h2:-soup-contains("day")'):
    for item in day.find_next('li').select('li:has(li)'):
        print(item.text)

例子

html = '''<div class="show-content user_content clearfix enhanced" data-uw-styling-context="true"> <h1 class="page-title" data-uw-styling-context="true">Unit 3 I Week 3</h1>   <div style="background-color: #184366; color: white; padding: 15px;" data-uw-styling-context="true"> <h2 data-uw-styling-context="true"><span style="font-size: 30pt;" data-uw-styling-context="true">Unit 3 | Week 3: January 18th-21st</span></h2> </div> <h2 data-uw-styling-context="true">Essential Questions</h2> <ul data-uw-styling-context="true"> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How does voice relate to the audience and purpose?</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">What techniques does the author use to get his/her point across and communicate?</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How can technology be beneficial and/or detrimental to society?</span></li> </ul> <h2 data-uw-styling-context="true">Objectives</h2> <ul data-uw-styling-context="true"> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze the concept of utopia/dystopia as presented in the novel</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a utopia to represent the ideas of the group and backed up with research</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze expository/informational text&nbsp;</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Understand rhetorical devices and logical fallacies</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Interpret elements of media including television and digital graphics</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a TV newscast that organizes and presents research with certain purposes and audiences in mind</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Collaborate to create a professional product</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Explain author’s purpose and message within a text</span></li> </ul> <p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p> <h2 data-uw-styling-context="true"> Monday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">No School</li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true"> Tuesday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Read Chapter 4</li> <li data-uw-styling-context="true">Annotations&nbsp;</li> <li data-uw-styling-context="true">Book Study</li> </ul> </li> <li data-uw-styling-context="true">Due Today:</li> <li data-uw-styling-context="true">Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems</li> <li data-uw-styling-context="true">Annotations and Book Study 1-4 due BOC Wed</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true"> Wednesday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Subject Complement Notes&nbsp;</li> <li data-uw-styling-context="true">"There Will Come Soft Rains"&nbsp;</li> </ul> </li> <li data-uw-styling-context="true">Due Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Annotations and Book Study Ch. 1-4</li> </ul> </li> <li data-uw-styling-context="true">Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems&nbsp;</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true"> Thursday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Subject Complement Practice</li> <li data-uw-styling-context="true">TWCSR</li> </ul> </li> <li data-uw-styling-context="true">Due Today:</li> <li data-uw-styling-context="true">Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems&nbsp;</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true"> Friday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Stems Quiz 5 Major Grade</li> <li data-uw-styling-context="true">TWCSR (Due Monday BOC)</li> </ul> </li> <li data-uw-styling-context="true">Due Today:</li> <li data-uw-styling-context="true">Homework for Next Class:</li> </ul> </li> </ul> <p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p> <p data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791827/download" alt="Left Arrow (1).png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791827" data-api-returntype="File" data-uw-styling-context="true"></p> <p data-uw-styling-context="true"><br data-uw-styling-context="true">&nbsp;<a title="Unit 3 Overview" href="https://fisd.instructure.com/courses/111538/pages/unit-3-overview" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/unit-3-overview" data-api-returntype="Page" data-uw-styling-context="true">Unit 3 Homepage</a></p> <p data-uw-styling-context="true">&nbsp;</p> <p data-uw-styling-context="true"><a title="Home" href="https://fisd.instructure.com/courses/111538/pages/home" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/home" data-api-returntype="Page" data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791834/download?wrap=1" alt="Home Black.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791834" data-api-returntype="File" data-uw-styling-context="true"> <br data-uw-styling-context="true">Course Homepage</a></p> <p data-uw-styling-context="true">&nbsp;</p>  </div> '''

soup=BeautifulSoup(html,'lxml')
data = []
for day in soup.select('h2:-soup-contains("day")'):
    d = {'day':day.text,'items':[]}
    for item in day.find_next('li').select('li:has(li)'):
        d['items'].append({'item':item.text})
    data.append(d)
data

输出

[{'day': ' Monday', 'items': []},
 {'day': ' Tuesday',
  'items': [{'item': 'In Class Today:  Read Chapter 4 Annotations\xa0 Book Study  '},
   {'item': 'Homework for Next Class:  Study Stems Annotations and Book Study 1-4 due BOC Wed  '}]},
 {'day': ' Wednesday',
  'items': [{'item': 'In Class Today:  Subject Complement Notes\xa0 "There Will Come Soft Rains"\xa0  '},
   {'item': 'Due Today:  Annotations and Book Study Ch. 1-4  '},
   {'item': 'Homework for Next Class:  Study Stems\xa0  '}]},
 {'day': ' Thursday',
  'items': [{'item': 'In Class Today:  Subject Complement Practice TWCSR  '},
   {'item': 'Homework for Next Class:  Study Stems\xa0  '}]},
 {'day': ' Friday',
  'items': [{'item': 'In Class Today:  Stems Quiz 5 Major Grade TWCSR (Due Monday BOC)  '}]}]