用 rvest 按 id 抓取 dataTable,找不到 table
Scraping dataTable with rvest by id, doesn't find table
我正在尝试从这里的数据表中抓取数据,通过 xpath 的调用 id
:
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()
最后一个命令给我这个错误:
Error in matrix(NA_character_, nrow = n, ncol = maxp) :
invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
NAs introduced by coercion to integer range
我在这里错过了什么?
你实际上已经有了
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>%
html_nodes(xpath = "//*[@id='qs-rankings-indicators']")
{xml_nodeset (1)}
[1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>
即没有最后一个 %>% html_table()
table 中没有数据的原因是它在初始 HTML 页面加载后加载了 javascript。
要获取 table 包括 javascript 加载的内容,您需要使用可以 运行网站的javascript(我会推荐RSelenium)
table 由 javascript 渲染。也许直接从源中获取 JSON 数据。尝试这样的事情
tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"
res <-
jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
"rank_display", "score", "title", "country", "region"
)]
输出
> head(res)
rank_display score title country region
1 1 100 Massachusetts Institute of Technology (MIT) United States North America
2 2 98.7 Stanford University United States North America
3 3 98.4 Harvard University United States North America
4 4 97.7 California Institute of Technology (Caltech) United States North America
5 5 95.6 University of Cambridge United Kingdom Europe
6 6 95.3 University of Oxford United Kingdom Europe
我正在尝试从这里的数据表中抓取数据,通过 xpath 的调用 id
:
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()
最后一个命令给我这个错误:
Error in matrix(NA_character_, nrow = n, ncol = maxp) :
invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
NAs introduced by coercion to integer range
我在这里错过了什么?
你实际上已经有了
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>%
html_nodes(xpath = "//*[@id='qs-rankings-indicators']")
{xml_nodeset (1)}
[1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>
即没有最后一个 %>% html_table()
table 中没有数据的原因是它在初始 HTML 页面加载后加载了 javascript。
要获取 table 包括 javascript 加载的内容,您需要使用可以 运行网站的javascript(我会推荐RSelenium)
table 由 javascript 渲染。也许直接从源中获取 JSON 数据。尝试这样的事情
tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"
res <-
jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
"rank_display", "score", "title", "country", "region"
)]
输出
> head(res)
rank_display score title country region
1 1 100 Massachusetts Institute of Technology (MIT) United States North America
2 2 98.7 Stanford University United States North America
3 3 98.4 Harvard University United States North America
4 4 97.7 California Institute of Technology (Caltech) United States North America
5 5 95.6 University of Cambridge United Kingdom Europe
6 6 95.3 University of Oxford United Kingdom Europe