rvest:语言选择在 tripadvisor 中不起作用

rvest: language selection not working in tripadvisor

我遇到了网页抓取问题。我打算在 tripadvisor 上收集一些评论。我想使用 rvest 并获得所有语言的评论。从 this questions 我了解到一种可能的方法是在 url 的末尾使用 ?filterLang=ALL。在网络浏览器中,它确实有效。示例:

https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL

是否提供选择了“所有语言”的评论(您可以看到很多法语评论)。这是我的问题:我尝试获取评论标题:

library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"

reviews_html <- read_html(url)

reviews_html %>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

 [1] "I've never visited this restaurant," "Perfect"                            
 [3] "Memorable experience"                "Tasty"                              
 [5] "Absolutely spectacular"              "Excellent"                          
 [7] "Wonderfullll"                        "A Perfect Evening"                  
 [9] "Dinner "                             "Perfect dinner and evening" 

我只有英文的。奇怪的是:如果我尝试获取页数:

reviews_html %>%
  html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
  html_text()

[1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "176"

我有“所有语言”选项对应的评论页数!如果和没有选择语言的情况比较

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"

reviews_html <- read_html(url)

reviews_html %>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

 [1] "I've never visited this restaurant," "Perfect"                            
 [3] "Memorable experience"                "Tasty"                              
 [5] "Absolutely spectacular"              "Excellent"                          
 [7] "Wonderfullll"                        "A Perfect Evening"                  
 [9] "Dinner "                             "Perfect dinner and evening" 

我得到了相同的评论,但是:

reviews_html %>%
  html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
  html_text()

[1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "61" 

我得到了对应英文语言选择的页数。 我也尝试设置 cookies:

library(httr)

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url, 
          set_cookies(`TALanguage` = "ALL",
                      `Domain` = ".tripadvisor.com"))%>%
  read_html()%>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

但是也没用。 有谁知道发生了什么,以及我可以做些什么来使用 rvest 获得所有语言的评论?

当您手动 select 过滤器时,会在同一个 url 上调用 POST。在表单正文中正确设置 filterLang=ALL returns 数据:

library(rvest)
library(httr)

reviews_html <- POST(
    "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html",
    add_headers('x-requested-with'= 'XMLHttpRequest'),
    body = list(
      preferFriendReviews = "FALSE",
      t = "",
      q = "", # filter by mention, try "france"
      filterSeasons = "", # "1" is mar-may / "2" is jun-aug / "3" is sep-nov / "4" is dec-feb
      filterLang = "ALL", # try "zhCN" or "fr"
      filterSafety = "FALSE",
      filterSegment = "", # "3" is families / "2" is couples / "5" is solo / "1" is business / "4" is friends
      trating = "", # stars: "5" / "4" / "3" / "2" / "1" / "0"
      isLastPoll = "false",
      changeSet = "REVIEW_LIST"
    ), 
    encode = "form") %>%
    read_html()

reviews <- reviews_html %>%
    html_nodes(xpath = "//span[@class='noQuotes']") %>%
    html_text()

print(reviews)

pages  <- reviews_html %>%
  html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
  html_text()

print(pages)

在上面的代码中,如果您需要这些过滤器,我添加了一些关于字段的描述

kaggle link

输出:

 [1] "I've never visited this restaurant," "Excellente expérience"              
 [3] "Du grand art"                        "Promesse tenue"                     
 [5] "Une soirée de rêve en famille"       "Délicieux !!! "                     
 [7] "Une expérience inoubliable"          "UN CERTAIN REGARD"                  
 [9] "Excellent soiree en couple"          "Une soirée magnifique"              
[1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "176"