Python 抓取 - 计算元素并获取文本
Python Crawl - count elements and get texts
我正在尝试抓取网站。 url 在这里 https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50
第一个问题是评分有星。所以我的问题是我怎样才能得到他们评价的星星?我需要整数结果。
<span class="rating-stars text-primary-darker mr-0_25" aria-label="5 out of 5 stars">
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
</span>
第二个问题是如何拆分并获取日期数据和用户名?
我试过了
source.find(class_ = 'small text-gray mb-2') #type: bs4.element.Tag
有输出
<div class="small text-gray mb-2"><div>Vic<!-- -->, <!-- -->10/17/2020</div><div>EX 4dr SUV (3.8L 6cyl 8A)</div></div>
Vic为用户名,10/17/2020为日期数据
这是我的代码。
chrome_driver = '/Users/chromedriver'
driver = webdriver.Chrome(chrome_driver)
url = 'https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50'
driver.get(url)
src = driver.page_source
source = BeautifulSoup(src, 'html.parser', from_encoding='utf-8')
review_list = source.find_all('div', class_ = "review-item text-gray-darker")
sid = SentimentIntensityAnalyzer()
sum_review = ''
driver.close()
for review in review_list:
list1 = []
score = review.find('span').get_text()
title = review.find('a').get_text().replace('\n', '')
writer = review.find('div', {'class': 'small text-gray mb-2'}).get_text()
date = review.find('span', {'class': 'review-date'}).get_text()
content = review.find('div', {'class': 'truncated-text size-16'}).get_text()
list1.append(score)
list1.append(title)
list1.append(writer)
list1.append(date)
list1.append(content)
sum_review = sum_review + content
lines_list = tokenize.sent_tokenize(content)
非常感谢您的回答!
import requests
import re
import json
import pandas as pd
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0'
}
r = requests.get(url, headers=headers)
match = json.loads(
re.search(r'__PRELOADED_STATE__ = ({.+})', r.text).group(1))
allin = []
for item in match['consumerReviews']['consumerReviews']['reviews']:
goal = [
item['author']['authorName'],
item['created'],
item['vehicleRating']['overall'],
item['title'],
item['text']
]
allin.append(goal)
df = pd.DataFrame(
allin, columns=['Author', 'Date', 'Rate', 'Title', 'Content'])
df.to_csv('Data.csv', index=False)
print(df)
main('https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50')
我正在尝试抓取网站。 url 在这里 https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50
第一个问题是评分有星。所以我的问题是我怎样才能得到他们评价的星星?我需要整数结果。
<span class="rating-stars text-primary-darker mr-0_25" aria-label="5 out of 5 stars">
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
<span class="rating-star icon-star-full"></span>
</span>
第二个问题是如何拆分并获取日期数据和用户名?
我试过了
source.find(class_ = 'small text-gray mb-2') #type: bs4.element.Tag
有输出
<div class="small text-gray mb-2"><div>Vic<!-- -->, <!-- -->10/17/2020</div><div>EX 4dr SUV (3.8L 6cyl 8A)</div></div>
Vic为用户名,10/17/2020为日期数据
这是我的代码。
chrome_driver = '/Users/chromedriver'
driver = webdriver.Chrome(chrome_driver)
url = 'https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50'
driver.get(url)
src = driver.page_source
source = BeautifulSoup(src, 'html.parser', from_encoding='utf-8')
review_list = source.find_all('div', class_ = "review-item text-gray-darker")
sid = SentimentIntensityAnalyzer()
sum_review = ''
driver.close()
for review in review_list:
list1 = []
score = review.find('span').get_text()
title = review.find('a').get_text().replace('\n', '')
writer = review.find('div', {'class': 'small text-gray mb-2'}).get_text()
date = review.find('span', {'class': 'review-date'}).get_text()
content = review.find('div', {'class': 'truncated-text size-16'}).get_text()
list1.append(score)
list1.append(title)
list1.append(writer)
list1.append(date)
list1.append(content)
sum_review = sum_review + content
lines_list = tokenize.sent_tokenize(content)
非常感谢您的回答!
import requests
import re
import json
import pandas as pd
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0'
}
r = requests.get(url, headers=headers)
match = json.loads(
re.search(r'__PRELOADED_STATE__ = ({.+})', r.text).group(1))
allin = []
for item in match['consumerReviews']['consumerReviews']['reviews']:
goal = [
item['author']['authorName'],
item['created'],
item['vehicleRating']['overall'],
item['title'],
item['text']
]
allin.append(goal)
df = pd.DataFrame(
allin, columns=['Author', 'Date', 'Rate', 'Title', 'Content'])
df.to_csv('Data.csv', index=False)
print(df)
main('https://www.edmunds.com/kia/telluride/2021/consumer-reviews/?pagesize=50')