rvest 中 html_text 返回的单独字符串
Separate strings returned by html_text in rvest
我正在尝试使用 rvest 提取酒店的便利设施。
library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-Palazzo_Caruso-Rome_Lazio.html"
amenities<-hotel%>%
html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>%
html_text()
生成的文本不会将一种设施与另一种设施分开:
[1] "Paid private parking nearbyFree High Speed Internet (WiFi)Coffee shopBicycle toursWalking toursCar hireFax / photocopyingBaggage storageFree internetWifiPublic wifiInternetBreakfast availableBreakfast in the roomConciergeExecutive lounge accessNon-smoking hotelSun terrace24-hour front deskPrivate check-in / check-outLaundry service"
有什么方法可以在便利设施之间添加分隔符(例如“;”)?
您需要在 html 结构中深入一到两层才能将文本作为列表拉出。可以使用 html_children()
函数来做到这一点。
详情见评论:
library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-
Palazzo_Caruso-Rome_Lazio.html"
hotel<-read_html(hotel_url)
amenities<-hotel%>%
html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>%
html_children()
#last child node is the unhighlighted amenities
#get text for highlighted amenities
highlighted<-amenities[xml_length(amenities)==1] %>% html_text()
#drill down 1 more level for unhighlighted amenities
unhighlighted<-amenities[xml_length(amenities)>1] %>% html_children() %>% html_text()
> highlighted
[1] "Paid private parking nearby" "Free High Speed Internet (WiFi)" "Coffee shop" "Bicycle tours"
[5] "Walking tours" "Car hire" "Fax / photocopying" "Baggage storage"
> unhighlighted
[1] "Free internet" "Wifi" "Public wifi" "Internet"
[5] "Breakfast available" "Breakfast in the room" "Concierge" "Executive lounge access"
[9] "Non-smoking hotel" "Sun terrace" "24-hour front desk" "Private check-in / check-out"
[13] "Laundry service"
我正在尝试使用 rvest 提取酒店的便利设施。
library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-Palazzo_Caruso-Rome_Lazio.html"
amenities<-hotel%>%
html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>%
html_text()
生成的文本不会将一种设施与另一种设施分开:
[1] "Paid private parking nearbyFree High Speed Internet (WiFi)Coffee shopBicycle toursWalking toursCar hireFax / photocopyingBaggage storageFree internetWifiPublic wifiInternetBreakfast availableBreakfast in the roomConciergeExecutive lounge accessNon-smoking hotelSun terrace24-hour front deskPrivate check-in / check-outLaundry service"
有什么方法可以在便利设施之间添加分隔符(例如“;”)?
您需要在 html 结构中深入一到两层才能将文本作为列表拉出。可以使用 html_children()
函数来做到这一点。
详情见评论:
library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-
Palazzo_Caruso-Rome_Lazio.html"
hotel<-read_html(hotel_url)
amenities<-hotel%>%
html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>%
html_children()
#last child node is the unhighlighted amenities
#get text for highlighted amenities
highlighted<-amenities[xml_length(amenities)==1] %>% html_text()
#drill down 1 more level for unhighlighted amenities
unhighlighted<-amenities[xml_length(amenities)>1] %>% html_children() %>% html_text()
> highlighted
[1] "Paid private parking nearby" "Free High Speed Internet (WiFi)" "Coffee shop" "Bicycle tours"
[5] "Walking tours" "Car hire" "Fax / photocopying" "Baggage storage"
> unhighlighted
[1] "Free internet" "Wifi" "Public wifi" "Internet"
[5] "Breakfast available" "Breakfast in the room" "Concierge" "Executive lounge access"
[9] "Non-smoking hotel" "Sun terrace" "24-hour front desk" "Private check-in / check-out"
[13] "Laundry service"