rvest:语言选择在 tripadvisor 中不起作用
rvest: language selection not working in tripadvisor
我遇到了网页抓取问题。我打算在 tripadvisor 上收集一些评论。我想使用 rvest
并获得所有语言的评论。从 this questions 我了解到一种可能的方法是在 url 的末尾使用 ?filterLang=ALL
。在网络浏览器中,它确实有效。示例:
是否提供选择了“所有语言”的评论(您可以看到很多法语评论)。这是我的问题:我尝试获取评论标题:
library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我只有英文的。奇怪的是:如果我尝试获取页数:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "176"
我有“所有语言”选项对应的评论页数!如果和没有选择语言的情况比较
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我得到了相同的评论,但是:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "61"
我得到了对应英文语言选择的页数。
我也尝试设置 cookies:
library(httr)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url,
set_cookies(`TALanguage` = "ALL",
`Domain` = ".tripadvisor.com"))%>%
read_html()%>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
但是也没用。
有谁知道发生了什么,以及我可以做些什么来使用 rvest 获得所有语言的评论?
当您手动 select 过滤器时,会在同一个 url 上调用 POST
。在表单正文中正确设置 filterLang=ALL
returns 数据:
library(rvest)
library(httr)
reviews_html <- POST(
"https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html",
add_headers('x-requested-with'= 'XMLHttpRequest'),
body = list(
preferFriendReviews = "FALSE",
t = "",
q = "", # filter by mention, try "france"
filterSeasons = "", # "1" is mar-may / "2" is jun-aug / "3" is sep-nov / "4" is dec-feb
filterLang = "ALL", # try "zhCN" or "fr"
filterSafety = "FALSE",
filterSegment = "", # "3" is families / "2" is couples / "5" is solo / "1" is business / "4" is friends
trating = "", # stars: "5" / "4" / "3" / "2" / "1" / "0"
isLastPoll = "false",
changeSet = "REVIEW_LIST"
),
encode = "form") %>%
read_html()
reviews <- reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
print(reviews)
pages <- reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
print(pages)
在上面的代码中,如果您需要这些过滤器,我添加了一些关于字段的描述
输出:
[1] "I've never visited this restaurant," "Excellente expérience"
[3] "Du grand art" "Promesse tenue"
[5] "Une soirée de rêve en famille" "Délicieux !!! "
[7] "Une expérience inoubliable" "UN CERTAIN REGARD"
[9] "Excellent soiree en couple" "Une soirée magnifique"
[1] "Next" "1" "2" "3" "4" "5" "6" "176"
我遇到了网页抓取问题。我打算在 tripadvisor 上收集一些评论。我想使用 rvest
并获得所有语言的评论。从 this questions 我了解到一种可能的方法是在 url 的末尾使用 ?filterLang=ALL
。在网络浏览器中,它确实有效。示例:
是否提供选择了“所有语言”的评论(您可以看到很多法语评论)。这是我的问题:我尝试获取评论标题:
library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我只有英文的。奇怪的是:如果我尝试获取页数:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "176"
我有“所有语言”选项对应的评论页数!如果和没有选择语言的情况比较
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我得到了相同的评论,但是:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "61"
我得到了对应英文语言选择的页数。 我也尝试设置 cookies:
library(httr)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url,
set_cookies(`TALanguage` = "ALL",
`Domain` = ".tripadvisor.com"))%>%
read_html()%>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
但是也没用。 有谁知道发生了什么,以及我可以做些什么来使用 rvest 获得所有语言的评论?
当您手动 select 过滤器时,会在同一个 url 上调用 POST
。在表单正文中正确设置 filterLang=ALL
returns 数据:
library(rvest)
library(httr)
reviews_html <- POST(
"https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html",
add_headers('x-requested-with'= 'XMLHttpRequest'),
body = list(
preferFriendReviews = "FALSE",
t = "",
q = "", # filter by mention, try "france"
filterSeasons = "", # "1" is mar-may / "2" is jun-aug / "3" is sep-nov / "4" is dec-feb
filterLang = "ALL", # try "zhCN" or "fr"
filterSafety = "FALSE",
filterSegment = "", # "3" is families / "2" is couples / "5" is solo / "1" is business / "4" is friends
trating = "", # stars: "5" / "4" / "3" / "2" / "1" / "0"
isLastPoll = "false",
changeSet = "REVIEW_LIST"
),
encode = "form") %>%
read_html()
reviews <- reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
print(reviews)
pages <- reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
print(pages)
在上面的代码中,如果您需要这些过滤器,我添加了一些关于字段的描述
输出:
[1] "I've never visited this restaurant," "Excellente expérience"
[3] "Du grand art" "Promesse tenue"
[5] "Une soirée de rêve en famille" "Délicieux !!! "
[7] "Une expérience inoubliable" "UN CERTAIN REGARD"
[9] "Excellent soiree en couple" "Une soirée magnifique"
[1] "Next" "1" "2" "3" "4" "5" "6" "176"