R rvest:提取动态加载的 html table

R rvest: extracting html table that are loaded dynamically

我正在尝试提取 html table 并将其转换为 R 中的 data.frame 或 data.table。

我想提取包含比特币历史数据的table:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113

(完整 Xpath:/html/body/div/div[1]/div[2]/div[1]/div[2]/div[3]/div/ul[2]/li[5]/div/div/div[2]/div[3]/div/table

这是我目前尝试过的方法:

library(magrittr)
library(rvest)
URL <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20170101&end=20201113"
PRICES <- read_html(URL) %>% html_nodes("table")

然而,如您所见,历史价格 table 并未显示在输出中:

我的猜测是 table 在 页面的其余部分加载之后 加载。

理想情况下,我希望提取方法能够与其他加密货币历史 table 一起使用,例如:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113

你说得对 - 这个 table 是在页面加载后通过 XHR 调用动态加载的,所以你无法使用 rvest 获取它。也许最好的解决方案是找到生成 table 的 API 的地址。您可以使用浏览器中的开发人员工具执行此操作。然后您需要解析 json,这可能很棘手。例如,在您的情况下,我们可以这样做:

url <- paste0("https://web-api.coinmarketcap.com/v1/cryptocurrency/",
              "price-performance-stats/latest?id=1027&include_volume=true&", 
              "time_period=all_time,24h,7d,30d,90d,365d,yesterday")

res <- httr::content(httr::GET(url), "parsed")$data$`1027`$periods
df <- do.call(rbind, lapply(res, function(x) unlist(x$quote$USD[1:9])))
df <- as.data.frame(df, stringsAsFactors = FALSE)
for(i in c(1, 3, 5, 7, 9)) df[[i]] <- as.numeric(df[[i]])
for(i in c(2, 4, 6, 8)) df[[i]] <- strptime(df[[i]], "%Y-%m-%dT%H:%M:%S")
df[rev(order(df$open_timestamp)),]

#                open      open_timestamp      high      high_timestamp        low
# 24h       458.44767 2020-11-12 15:30:29  470.5202 2020-11-13 15:03:02 452.072417
# yesterday 462.95952 2020-11-12 00:00:00  467.6778 2020-11-12 10:08:13 452.072417
# 7d        440.22446 2020-11-06 15:30:29  473.5789 2020-11-11 18:50:25 428.456353
# 30d       380.66672 2020-10-14 15:30:29  473.5789 2020-11-11 18:50:03 362.597418
# 90d       434.50894 2020-08-15 15:30:29  487.2119 2020-09-01 22:17:01 316.774346
# 365d      185.43604 2019-11-14 15:30:29  487.2119 2020-09-01 00:00:00  95.184301
# all_time    2.83162 2015-08-07 00:00:00 1432.8800 2018-01-13 00:00:00   0.420897
#                 low_timestamp    close     close_timestamp percent_change
# 24h       2020-11-12 18:27:13 467.0052 2020-11-13 15:30:29      1.8666241
# yesterday 2020-11-12 18:27:13 461.0053 2020-11-12 23:59:59     -0.4221218
# 7d        2020-11-07 20:11:13 467.0052 2020-11-13 15:30:29      6.0834207
# 30d       2020-10-16 09:21:41 467.0052 2020-11-13 15:30:29     22.6808486
# 90d       2020-09-05 18:55:23 467.0052 2020-11-13 15:30:29      7.4788382
# 365d      2020-03-13 00:00:00 467.0052 2020-11-13 15:30:29    151.8416350
# all_time  2015-10-21 00:00:00 467.0052 2020-11-13 15:30:29  16392.5082715

按照@AllanCameron 的建议,我们可以使用Rseleniumrvest 提取table。这是一个对我有用的脚本:

library(RSelenium)
library(rvest)
library(magrittr)

URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"

# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]

# Extract table from source
DF <- read_html(html) %>% 
  html_nodes("table") %>% 
  `[[`(3) %>% 
  html_table %>% data.frame

# Close connection
remDr$close()