如何使用 python 和 bs4 从 html 标签中抓取数据
How do I scrape data from this html tags using python and bs4
我能够成功地抓取数据,我想如何将其写入 csv 文件(逗号分隔)
我想提取 class "ui-h2"、"help__content help__content--small" 和 "exam-name" 的元素。我有代码,但作为输出我想要
- 2022 年 4 月,2022 年 4 月 3 日,UPPSC ACF RFO 电源
- 2022 年 4 月,2022 年 4 月 3 日,MPSC C 组
<div class="item-name" data-toggle="collapse" data-target="#exam-4" aria-expanded=false>
<div class="ui-h2">April 2022 <span class="ui-tag grey-transparent">14 Exams</span></div>
</div>
<div class="item-details collapse " id="exam-4" data-parent="#exam-month">
<div class="row">
<div class="col-12 col-lg-4">
<div class="ui-card hover-scale">
<a href="https://testbook.com/uppsc-acf-rfo" class="card-link exam-cards">
<div>
<span class="icon calendar-icon"></span>
<span class="help__content help__content--small">3 Apr 2022</span>
<span class="ui-tag green-filled">Official</span>
</div>
<div class="footer-container">
<span class="exam-icon">
<img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/06/uttar-pradesh-logo-png-8-5bbbec3b.png" height="30">
</span>
<span class="exam-name" title="UPPSC ACF RFO Mains">UPPSC ACF RFO Mains</span>
<span class="exam-cta">
Know More <span class="right-icon"></span>
</span>
</div>
</a>
</div>
</div>
<div class="col-12 col-lg-4">
<div class="ui-card hover-scale">
<a href="https://testbook.com/mpsc-group-c" class="card-link exam-cards">
<div>
<span class="icon calendar-icon"></span>
<span class="help__content help__content--small">3 Apr 2022</span>
<span class="ui-tag green-filled">Official</span>
</div>
<div class="footer-container">
<span class="exam-icon">
<img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/03/mpsc-logo-1-44a80da2.png" height="30">
</span>
<span class="exam-name" title="MPSC Group C">MPSC Group C</span>
<span class="exam-cta">
Know More <span class="right-icon"></span>
</span>
</div>
</a>
</div>
</div>
</div>
</div>
for contents in soup.find_all("div", {"class":"ui-h2"}):
#print(contents)
if contents.text is not None:
#print(contents.text)
f.write(contents.text+"-")
for contentspan2 in soup.find_all("span", {"class":"help__content help__content--small"}):
if contentspan2.string is not None:
#print(contentspan2.string)
f.write(contentspan2.string+",")
for contentspan in soup.find_all("span", {"class":"exam-name"}):
if contentspan.string is not None:
#print(contentspan.string)
f.write(contentspan.string+"\n")
构造一个要写入文件的行列表。其中列表中的每个元素都是一个字典,其中 key:value 作为列 name:value。然后让 pandas 完成工作。
鉴于您提供的html:
soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
rowList = []
for row in rows:
cards = row.find_all('div', {'class':re.compile("^ui-card")})
for card in cards:
dateStr = card.find('span',{'class':re.compile("^help__content")}).text.strip()
examName = card.find('span', {'class':'exam-name'}).text
rowList.append({'date':dateStr,
'exam':examName})
df = pd.DataFrame(rowList)
df.to_csv('filename.csv', index=False)
输出:
print(df)
date exam
0 3 Apr 2022 UPPSC ACF RFO Mains
1 3 Apr 2022 MPSC Group C
我能够成功地抓取数据,我想如何将其写入 csv 文件(逗号分隔)
我想提取 class "ui-h2"、"help__content help__content--small" 和 "exam-name" 的元素。我有代码,但作为输出我想要
- 2022 年 4 月,2022 年 4 月 3 日,UPPSC ACF RFO 电源
- 2022 年 4 月,2022 年 4 月 3 日,MPSC C 组
<div class="item-name" data-toggle="collapse" data-target="#exam-4" aria-expanded=false> <div class="ui-h2">April 2022 <span class="ui-tag grey-transparent">14 Exams</span></div> </div> <div class="item-details collapse " id="exam-4" data-parent="#exam-month"> <div class="row"> <div class="col-12 col-lg-4"> <div class="ui-card hover-scale"> <a href="https://testbook.com/uppsc-acf-rfo" class="card-link exam-cards"> <div> <span class="icon calendar-icon"></span> <span class="help__content help__content--small">3 Apr 2022</span> <span class="ui-tag green-filled">Official</span> </div> <div class="footer-container"> <span class="exam-icon"> <img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/06/uttar-pradesh-logo-png-8-5bbbec3b.png" height="30"> </span> <span class="exam-name" title="UPPSC ACF RFO Mains">UPPSC ACF RFO Mains</span> <span class="exam-cta"> Know More <span class="right-icon"></span> </span> </div> </a> </div> </div> <div class="col-12 col-lg-4"> <div class="ui-card hover-scale"> <a href="https://testbook.com/mpsc-group-c" class="card-link exam-cards"> <div> <span class="icon calendar-icon"></span> <span class="help__content help__content--small">3 Apr 2022</span> <span class="ui-tag green-filled">Official</span> </div> <div class="footer-container"> <span class="exam-icon"> <img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/03/mpsc-logo-1-44a80da2.png" height="30"> </span> <span class="exam-name" title="MPSC Group C">MPSC Group C</span> <span class="exam-cta"> Know More <span class="right-icon"></span> </span> </div> </a> </div> </div> </div> </div>
for contents in soup.find_all("div", {"class":"ui-h2"}):
#print(contents)
if contents.text is not None:
#print(contents.text)
f.write(contents.text+"-")
for contentspan2 in soup.find_all("span", {"class":"help__content help__content--small"}):
if contentspan2.string is not None:
#print(contentspan2.string)
f.write(contentspan2.string+",")
for contentspan in soup.find_all("span", {"class":"exam-name"}):
if contentspan.string is not None:
#print(contentspan.string)
f.write(contentspan.string+"\n")
构造一个要写入文件的行列表。其中列表中的每个元素都是一个字典,其中 key:value 作为列 name:value。然后让 pandas 完成工作。
鉴于您提供的html:
soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
rowList = []
for row in rows:
cards = row.find_all('div', {'class':re.compile("^ui-card")})
for card in cards:
dateStr = card.find('span',{'class':re.compile("^help__content")}).text.strip()
examName = card.find('span', {'class':'exam-name'}).text
rowList.append({'date':dateStr,
'exam':examName})
df = pd.DataFrame(rowList)
df.to_csv('filename.csv', index=False)
输出:
print(df)
date exam
0 3 Apr 2022 UPPSC ACF RFO Mains
1 3 Apr 2022 MPSC Group C