使用 python 抓取预订评论
Scraping Booking coments with python
我正在尝试从该网站获取 Booking.com 条评论的标题:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
其中 r_lang=all
基本上是说网站应该显示所有语言的评论。
为了从此页面获取标题,我这样做了:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
从网站上看(见截图),前两个标题应该是 "Sencillamente placentera" 和 "It could have been great."。然而,不知何故 url 只加载西班牙语评论:
“Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo”
“我对 ubicación 充满热情,对 la vista 充满热情。”
“Su ubicación es muy buena。”
我注意到如果在 url 中将 'museo.es.' 更改为 'museo.en.',我会得到 headers 的英文评论。但这是不一致的,因为如果我加载原始 url,我会收到英语、法语、西班牙语等的评论。我该如何解决这个问题?谢谢
你总是可以使用浏览器作为 B 计划。Selenium 没有这个问题
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
服务器可以配置为根据发出请求的浏览器发送不同的响应。添加 User-Agent
似乎可以解决问题。
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
输出:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
访问 Booking.com 评论的新方法是使用新的 reviewlist.html
端点。例如,原始问题中的酒店评论位于:
这个端点特别棒,因为它支持许多过滤器并且每页最多提供 25 条评论。
这是 Python 中的一个片段 parsel
和 httpx
:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
我在我的博客 How to Scrape Booking.com 上写了更多关于抓取此端点的内容,如果需要更多信息,其中有更多插图和视频。
我正在尝试从该网站获取 Booking.com 条评论的标题:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
其中 r_lang=all
基本上是说网站应该显示所有语言的评论。
为了从此页面获取标题,我这样做了:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
从网站上看(见截图),前两个标题应该是 "Sencillamente placentera" 和 "It could have been great."。然而,不知何故 url 只加载西班牙语评论: “Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo”
“我对 ubicación 充满热情,对 la vista 充满热情。”
“Su ubicación es muy buena。”
我注意到如果在 url 中将 'museo.es.' 更改为 'museo.en.',我会得到 headers 的英文评论。但这是不一致的,因为如果我加载原始 url,我会收到英语、法语、西班牙语等的评论。我该如何解决这个问题?谢谢
你总是可以使用浏览器作为 B 计划。Selenium 没有这个问题
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
服务器可以配置为根据发出请求的浏览器发送不同的响应。添加 User-Agent
似乎可以解决问题。
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
输出:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
访问 Booking.com 评论的新方法是使用新的 reviewlist.html
端点。例如,原始问题中的酒店评论位于:
这个端点特别棒,因为它支持许多过滤器并且每页最多提供 25 条评论。
这是 Python 中的一个片段 parsel
和 httpx
:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
我在我的博客 How to Scrape Booking.com 上写了更多关于抓取此端点的内容,如果需要更多信息,其中有更多插图和视频。