Rvest：导航 html 个节点

Question

我正在尝试从 HLTV.org 中抓取结果数据：我的目标是抓取每场比赛的赢家、输家和日期。

我已经成功地抓取了每场比赛的赢家和输家 - 日期的结构是我苦苦挣扎的地方。每天的结果保存在 <div class="results-sublist"> 内日期的值保存在嵌套在 div 内的范围内（注意：Chrome 告诉我这是一个 div，这是一个谎言）：

 <div class="standard-headline">Results for December 28th 2020</div>

在与此跨度相同的 'level' 上（即在“结果子列表”div 的正下方，我们有当天每场比赛的结果。

在每天循环之前，我试图用一天的结果创建一个 table。

我目前的代码如下：

url <- "https://www.hltv.org/results"
s <- rvest::html_session(url)
s_tree <- xml2::read_html(s)


day_results_nodes <- s_tree %>%
 html_nodes(xpath="//div[contains(@class, 'results-sublist')]")

i <- day_results_nodes[[1]]

date <- i %>%
  html_nodes(xpath = "//span[contains(@class, 'standard-headline')]") %>%
  html_text()

winner <- i %>%
 xml2::xml_find_all("//div[contains(@class, 'team team-won')]") %>%
 rvest::html_text()

loser <- i %>%
 xml2::xml_find_all(xpath="//div[@class='team ']") %>%
 rvest::html_text()

page_results <- cbind(winner, loser, x1 = date)

page_results（实际返回）

	winner	loser	x1
1	ex-ETHEREAL	Lyngby Vikings	Results for December 28th 2020
2	Winstrike	MBAPPEEK	Results for December 27th 2020
3	Project X	Lilmix	Results for December 25th 2020
4	Budapest Five	Lilmix	Results for December 24th 2020
5	Project X	The Incas	Results for December 23rd 2020
...	...	...	...
100	Movistar Riders	Endpoint	Results for December 28th 2020

page_results（预期）

	winner	loser	x1
1	ex-ETHEREAL	Lyngby Vikings	Results for December 28th 2020
2	Winstrike	MBAPPEEK	Results for December 28th 2020
3	Project X	Lilmix	Results for December 28th 2020

优点： day_results_nodes 的行为符合预期。 day_results_nodes returns 11 html 个节点，每个节点下面的 div 与当天的匹配项一样多。

否定： page_results returns 网页上的 100 个结果中的 table，而不是附加到第一个“结果子列表”的四个匹配项 div。日期列只是循环浏览网页上的 11 个可用日期，而不是对应于每个匹配项的 'standard headline' 对应的日期（我希望在当天的 'results-sublist' 中的每一行中广播单个日期值])

我收到以下警告消息：

Warning in cbind(winner, loser, x1 = date) : number of rows of result is not a multiple of vector length (arg 3)

我认为这是广播未按预期工作的副产品。

我不清楚为什么 i <- day_results_nodes[[1]] 似乎没有引用第一个节点（即最近一天的结果数据）；按预期打印 i returns 一个 html 节点和三个类节点。这让我相信我的错误在于 xml_find_all()，尽管我不明白为什么。

Answer 1

xml_find_all 是从根节点开始的，因为您没有将前导 . 添加到您的 xpath。我在最后显示更正。

来自documentation：

您希望每个组中的节点数量正确（例如，1 个日期与相同数量的赢家和输家），以便日期回收正确填充。每个块应该 return 一个日期，所以你只需要 html_node；对于 winners/losers，我会切换到 css，因为这 return 是相对于当前节点（来自您的索引）的正确节点数。使用 css 也更快。为了获得正确数量的失败者，我使用 :not 伪 class 选择器来删除具有多值 classes 的节点作为获胜者；其中有 team-won class.

library(rvest)
library(magrittr)

url <- "https://www.hltv.org/results"
s <- rvest::html_session(url)
s_tree <- xml2::read_html(s)

day_results_nodes <- s_tree %>%
  html_nodes('.results-sublist')

i <- day_results_nodes[[1]]

date <- i %>%
  html_node('.standard-headline') %>%
  html_text()

winner <- i %>%
  html_nodes('.team-won') %>%
  html_text()
   
loser <- i %>%
  html_nodes('.team:not(.team-won)') %>%
  html_text()

page_results <- cbind(winner, loser, x1 = date)

xpath

library(rvest)
library(magrittr)

url <- "https://www.hltv.org/results"
s <- rvest::html_session(url)
s_tree <- xml2::read_html(s)


day_results_nodes <- s_tree %>%
  html_nodes(xpath="//div[contains(@class, 'results-sublist')]")

i <- day_results_nodes[[1]]

date <- i %>%
  html_node(xpath = ".//span[contains(@class, 'standard-headline')]") %>%
  html_text()

winner <- i %>%
  xml2::xml_find_all(".//div[contains(@class, 'team team-won')]") %>%
  rvest::html_text()

loser <- i %>%
  xml2::xml_find_all(xpath=".//div[@class='team ']") %>%
  rvest::html_text()

page_results <- cbind(winner, loser, x1 = date)

page_results

Rvest：导航 html 个节点

Rvest: navigating html nodes

html

r

rvest