导致此特定错误消息的语法错误是什么？

Question

我在 RStudio 中使用 R，我有一个 R 脚本可以执行网络抓取。当运行这些特定行时，我遇到了一条错误消息：

      review<-ta1 %>%
              html_node("body") %>%
              xml_find_all("//div[contains@class,'location-review-review']")

报错信息如下：

xmlXPathEval: evaluation failed
Error in `*tmp*` - review : non-numeric argument to binary operator
In addition: Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid predicate [1206]

注意：我的 R 脚本中加载了 dplyr 和 rvest 库。

我在 Whosebug 上查看了以下问题的答案：二元运算符错误的非数字参数

我觉得我的解决方案与 Richard Border 对上面链接的问题提供的答案有关。但是，我很难根据该答案找出如何更正我的 R 语法。

感谢您调查我的问题。

添加的 ta1 示例：

{xml_document}
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\n<link rel="icon" id="favicon"  ...
[2] <body class="rebrand_2017 desktop_web Hotel_Review  js_logging" id="BODY_BLOCK_JQUERY_REFLOW" data-tab="TAB ...

Answer 1

我将在这里做一些假设，因为您的 post 没有包含足够的信息来生成可重现的示例。

首先，我猜您正在尝试抓取 TripAdvisor，因为 id 和 class 字段与该网站匹配，并且您的变量名为 ta1.

其次，我假设您正在尝试获取每条评论的文本和每条评论的星数，因为这是您 classes 中每个相关的可抓取信息似乎在寻找。

我需要先获取我自己的 ta1 变量版本，因为它无法从您编辑的版本中重现。

library(httr)
library(rvest)
library(xml2)
library(magrittr)
library(tibble)

"https://www.tripadvisor.co.uk/"                          %>% 
paste0("Hotel_Review-g186534-d192422-Reviews-")           %>%
paste0("Glasgow_Marriott_Hotel-Glasgow_Scotland.html") -> url

ta1 <- url %>% GET %>% read_html

现在为感兴趣的数据编写正确的 xpaths

# xpath for elements whose text contains reviews
xpath1 <- "//div[contains(@class, 'location-review-review-list-parts-Expand')]"

# xpath for the elements whose class indicate the ratings
xpath2 <- "//div[contains(@class, 'location-review-review-')]"
xpath3 <- "/span[contains(@class, 'ui_bubble_rating bubble_')]"

我们可以这样获取评论的文本：

ta1                                             %>% 
xml_find_all(xpath1)                            %>% # run first query
html_text()                                     %>% # extract text
extract(!equals(., "Read more")) -> reviews         # remove "blank" reviews

相关的星级评分如下：

ta1 %>% 
xml_find_all(paste0(xpath2, xpath3)) %>% 
xml_attr("class")                    %>% 
strsplit("_")                        %>%
lapply(function(x) x[length(x)])     %>% 
as.numeric                           %>% 
divide_by(10)                         -> stars

我们的结果是这样的：

tibble(rating = stars, review = reviews)
## A tibble: 5 x 2
#  rating review                                                                                             
#   <dbl> <chr>                                                                                              
#1      1 7 of us attended the Christmas Party on Satu~
#2      4 "We stayed 2 nights over last weekend to att~
#3      3 Had a good stay, but had no provision to kee~
#4      3 Booked an overnight for a Christmas shopping~
#5      4 Attended a charity lunch here on Friday and ~

导致此特定错误消息的语法错误是什么？

What syntax error is causing this specific error message?

r

web-scraping

dplyr

rvest