使用 rvest 在 r 中进行 Web 抓取：如果缺少 div return NA

Question

我正在尝试抓取四个 div 秒的工具提示：射击类型、打击类型、时间和射击 xg。有时 div 中的一个会丢失。例如，下面的 [2] 没有“tooltip-shoot-xg” div.

如果缺少四个组件中的任何一个，我将如何循环遍历 div.tooltip 和 return 一个 NA？

[1] "<div class=\"tooltip\" style=\"left: 37.5%; top: 36.5789%;\">\n<div class=\"tooltip-title\">\n<div class=\"tooltip-shoot-type\">Shot blocked</div>\n<div class=\"tooltip-blow-type\">Smith </div>\n<div class=\"tooltip-shoot-name\"></div>\n</div>\n<div class=\"tooltip-time\">a </div>\n<div class=\"tooltip-time\">Half 1, 09:18 28/01/18</div>\n<div class=\"tooltip-shoot-xg\">Expected goals: 0.09</div>\n</div>"

[2] "<div class=\"tooltip\" style=\"left: 54.7059%; top: 11.0526%;\">\n<div class=\"tooltip-title\">\n<div class=\"tooltip-shoot-type\">Own goal</div>\n<div class=\"tooltip-blow-type\">Johnson </div>\n<div class=\"tooltip-shoot-name\"></div>\n</div>\n<div class=\"tooltip-time\">h </div>\n<div class=\"tooltip-time\">Half 1, 14:36 28/01/18</div>\n</div>"

以上是

的结果

pg %>% 
  html_nodes("div.tooltip")

Answer 1

如果你使用XPath选择器，一个点（.）代表当前节点，你可以从中找到子div相对。在此示例中，代码如下所示：

divs <- pg %>% html_nodes("div.tooltip")
for (i in 1:length(divs)){
  shoot-type <- divs[i] %>% html_node(xpath = "./div[@class='tooltip-shoot-type']") %>% html_text()
  blow-type <- divs[i] %>% html_node(xpath = "./div[@class='tooltip-blow-type']") %>% html_text()
  time <- divs[i] %>% html_node(xpath = "./div[@class='tooltip-time']") %>% html_text()
  shoot-xg <- divs[i] %>% html_node(xpath = "./div[@class='tooltip-shoot-xg']") %>% html_text()
  # add code here to save data
}

那么如果节点中没有tooltip-shoot-xg class div，shoot-xg就会return NA。

使用 rvest 在 r 中进行 Web 抓取：如果缺少 div return NA

Web scraping in r with rvest: if div is missing return NA

r

rvest