用 rvest 按 id 抓取 dataTable，找不到 table

Question

我正在尝试从这里的数据表中抓取数据，通过 xpath 的调用 id:

library(rvest)
library(dplyr)

url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  

h <- url %>% read_html() 

h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()

最后一个命令给我这个错误：

Error in matrix(NA_character_, nrow = n, ncol = maxp) : 
  invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
  NAs introduced by coercion to integer range

我在这里错过了什么？

Answer 1

你实际上已经有了

library(rvest)
library(dplyr)

url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  

h <- url %>% read_html() 

h %>% 
  html_nodes(xpath = "//*[@id='qs-rankings-indicators']")

{xml_nodeset (1)}
[1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>

即没有最后一个 %>% html_table()

table 中没有数据的原因是它在初始 HTML 页面加载后加载了 javascript。

要获取 table 包括 javascript 加载的内容，您需要使用可以运行网站的javascript（我会推荐RSelenium）

Answer 2

table 由 javascript 渲染。也许直接从源中获取 JSON 数据。尝试这样的事情

tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"

res <- 
  jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
    "rank_display", "score", "title", "country", "region"
  )]

输出

> head(res)
  rank_display score                                        title        country        region
1            1   100  Massachusetts Institute of Technology (MIT)  United States North America
2            2  98.7                          Stanford University  United States North America
3            3  98.4                           Harvard University  United States North America
4            4  97.7 California Institute of Technology (Caltech)  United States North America
5            5  95.6                      University of Cambridge United Kingdom        Europe
6            6  95.3                         University of Oxford United Kingdom        Europe

用 rvest 按 id 抓取 dataTable，找不到 table

Scraping dataTable with rvest by id, doesn't find table

r

web-scraping

rvest