Select class HTML 解析器中的名称包含额外的单词
Select class name in HTML Parser containing extra words
我正在尝试抓取网页。我想获得评论。但是评论分为三类,一些是正面的,一些是中性的,一些是负面的。我正在使用 html 解析器并访问了很多标签。但是对于可以分为三类的class,我如何才能得到它们:
<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>
我有一个 python 个容器,每个 div 包含每个项目:
# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})`
for container in containers:
title = container.findAll(a).text #This gives me titles
##Similarly I need the reviews of each of them here
review = container.findAll("div", {"class": "review "}))#along with review there is positive, neutral and negative word also according to the type of review
使用正则表达式,您可以获得包含子字符串 "review"
.
的 类
import re
for container in containers:
title = container.findAll(a).text #This gives me titles
review = container.findAll("div", {"class": re.compile(r'review')})
看区别:
html = '''<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
review = soup.find_all('div', {'class':'review '})
print ('No regex: ',review)
print('\n')
review = soup.findAll("div", {"class": re.compile(r'review')})
print ('Regex: ',review)
输出:
No regex: []
Regex: [<div class="review positive" style="background-color: #00B551;" title="">9.3</div>, <div class="review negative" style="background-color: #FF0000;" title="">4.8</div>, <div class="review neutral" style="background-color: #FFFF00;" title="">6</div>]
我正在尝试抓取网页。我想获得评论。但是评论分为三类,一些是正面的,一些是中性的,一些是负面的。我正在使用 html 解析器并访问了很多标签。但是对于可以分为三类的class,我如何才能得到它们:
<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>
我有一个 python 个容器,每个 div 包含每个项目:
# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})`
for container in containers:
title = container.findAll(a).text #This gives me titles
##Similarly I need the reviews of each of them here
review = container.findAll("div", {"class": "review "}))#along with review there is positive, neutral and negative word also according to the type of review
使用正则表达式,您可以获得包含子字符串 "review"
.
import re
for container in containers:
title = container.findAll(a).text #This gives me titles
review = container.findAll("div", {"class": re.compile(r'review')})
看区别:
html = '''<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
review = soup.find_all('div', {'class':'review '})
print ('No regex: ',review)
print('\n')
review = soup.findAll("div", {"class": re.compile(r'review')})
print ('Regex: ',review)
输出:
No regex: []
Regex: [<div class="review positive" style="background-color: #00B551;" title="">9.3</div>, <div class="review negative" style="background-color: #FF0000;" title="">4.8</div>, <div class="review neutral" style="background-color: #FFFF00;" title="">6</div>]