使用 beautifulsoup 从 html 页面获取星级
Take the star rating from html page using beautifulsoup
我正在尝试从此页面获取星级 (https://www.edmunds.com/tesla/model-3/2019/consumer-reviews/)
我说的是安全、性能、舒适等部分
下面是 html 代码:
<div class="justify-content-between flex-column flex-md-row row"><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Safety</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Technology</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Performance</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Interior</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Comfort</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Reliability</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Value</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl></div></div></div>
如果代码太长,我会发布屏幕截图
这是我使用的代码,但是当涉及到上述标签时它不起作用
data = []
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.edmunds.com/tesla/model-3/2019/consumer-reviews/'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'review-item'})
for e in content_list:
d = {'review_title': e.a.text,
'review_content': e.select_one('p').text,
'overall_rating': e.select_one('span.sr-only').text,
'reviewer_name':e.div.text.split(',')[0].strip(),
'review_date':e.div.text.split(',')[1].strip(),
}
data.append(d)
df = pd.DataFrame(data)
df1 = df.drop_duplicates(subset=['reviewer_name', 'review_title'], keep='first')
基本上,我想要实现的是为每个星级评分设置列,例如安全性:5.0、性能:5.0、舒适性:5.0 等等。
我正在尝试使用这部分代码:
d.update(dict(s.stripped_strings for s in e.select('span.rating-stars span.sr-only')))
data.append(d)
然而它不起作用。此外,包含总体星级和详细星级的标签具有相同的 class,不同之处在于这两个标签位于不同的标签下(我希望我没有把它弄得太复杂)。无论如何,我希望有人能帮助我。
编辑
我稍微编辑了一段代码,因为我粘贴的代码似乎不起作用,这很奇怪
一般来说,在正确选择元素的情况下使用 stripped_strings
会很安静:
d.update(dict(s.stripped_strings for s in e.select('dl')))
由于您的预期输出,我建议分别为 key
和 value
选择字符串:
...
d.update({s.dt.text:float(s.dd.text.split()[0]) for s in e.select('dl')})
data.append(d)
...
这会将您的 dict
更新为:
{'Safety': 5.0, 'Technology': 5.0, 'Performance': 5.0, 'Interior': 5.0, 'Comfort': 5.0, 'Reliability': 5.0, 'Value': 5.0}
或者在没有 ResultSet
且 dict
为空的情况下。
我正在尝试从此页面获取星级 (https://www.edmunds.com/tesla/model-3/2019/consumer-reviews/)
我说的是安全、性能、舒适等部分
下面是 html 代码:
<div class="justify-content-between flex-column flex-md-row row"><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Safety</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Technology</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Performance</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Interior</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Comfort</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Reliability</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl><dl class="mb-1 d-flex justify-content-between pr-1_5 pr-sm-0 pr-md-1_5 pr-lg-0 pr-xl-2_5 col-7 col-sm-4 col-md-5"><dt class="font-weight-normal">Value</dt><dd class="mb-0"><span class="rating-stars text-primary-darker"><span class="sr-only">5 out of 5 stars</span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span><span class="rating-star icon-star-full"></span></span></dd></dl></div></div></div>
如果代码太长,我会发布屏幕截图
这是我使用的代码,但是当涉及到上述标签时它不起作用
data = []
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.edmunds.com/tesla/model-3/2019/consumer-reviews/'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'review-item'})
for e in content_list:
d = {'review_title': e.a.text,
'review_content': e.select_one('p').text,
'overall_rating': e.select_one('span.sr-only').text,
'reviewer_name':e.div.text.split(',')[0].strip(),
'review_date':e.div.text.split(',')[1].strip(),
}
data.append(d)
df = pd.DataFrame(data)
df1 = df.drop_duplicates(subset=['reviewer_name', 'review_title'], keep='first')
基本上,我想要实现的是为每个星级评分设置列,例如安全性:5.0、性能:5.0、舒适性:5.0 等等。
我正在尝试使用这部分代码:
d.update(dict(s.stripped_strings for s in e.select('span.rating-stars span.sr-only')))
data.append(d)
然而它不起作用。此外,包含总体星级和详细星级的标签具有相同的 class,不同之处在于这两个标签位于不同的标签下(我希望我没有把它弄得太复杂)。无论如何,我希望有人能帮助我。
编辑 我稍微编辑了一段代码,因为我粘贴的代码似乎不起作用,这很奇怪
一般来说,在正确选择元素的情况下使用 stripped_strings
会很安静:
d.update(dict(s.stripped_strings for s in e.select('dl')))
由于您的预期输出,我建议分别为 key
和 value
选择字符串:
...
d.update({s.dt.text:float(s.dd.text.split()[0]) for s in e.select('dl')})
data.append(d)
...
这会将您的 dict
更新为:
{'Safety': 5.0, 'Technology': 5.0, 'Performance': 5.0, 'Interior': 5.0, 'Comfort': 5.0, 'Reliability': 5.0, 'Value': 5.0}
或者在没有 ResultSet
且 dict
为空的情况下。