R rvest:提取动态加载的 html table
R rvest: extracting html table that are loaded dynamically
我正在尝试提取 html table 并将其转换为 R 中的 data.frame 或 data.table。
我想提取包含比特币历史数据的table:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113
(完整 Xpath:/html/body/div/div[1]/div[2]/div[1]/div[2]/div[3]/div/ul[2]/li[5]/div/div/div[2]/div[3]/div/table
)
这是我目前尝试过的方法:
library(magrittr)
library(rvest)
URL <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20170101&end=20201113"
PRICES <- read_html(URL) %>% html_nodes("table")
然而,如您所见,历史价格 table 并未显示在输出中:
我的猜测是 table 在 页面的其余部分加载之后 加载。
理想情况下,我希望提取方法能够与其他加密货币历史 table 一起使用,例如:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113
你说得对 - 这个 table 是在页面加载后通过 XHR 调用动态加载的,所以你无法使用 rvest
获取它。也许最好的解决方案是找到生成 table 的 API 的地址。您可以使用浏览器中的开发人员工具执行此操作。然后您需要解析 json,这可能很棘手。例如,在您的情况下,我们可以这样做:
url <- paste0("https://web-api.coinmarketcap.com/v1/cryptocurrency/",
"price-performance-stats/latest?id=1027&include_volume=true&",
"time_period=all_time,24h,7d,30d,90d,365d,yesterday")
res <- httr::content(httr::GET(url), "parsed")$data$`1027`$periods
df <- do.call(rbind, lapply(res, function(x) unlist(x$quote$USD[1:9])))
df <- as.data.frame(df, stringsAsFactors = FALSE)
for(i in c(1, 3, 5, 7, 9)) df[[i]] <- as.numeric(df[[i]])
for(i in c(2, 4, 6, 8)) df[[i]] <- strptime(df[[i]], "%Y-%m-%dT%H:%M:%S")
df[rev(order(df$open_timestamp)),]
# open open_timestamp high high_timestamp low
# 24h 458.44767 2020-11-12 15:30:29 470.5202 2020-11-13 15:03:02 452.072417
# yesterday 462.95952 2020-11-12 00:00:00 467.6778 2020-11-12 10:08:13 452.072417
# 7d 440.22446 2020-11-06 15:30:29 473.5789 2020-11-11 18:50:25 428.456353
# 30d 380.66672 2020-10-14 15:30:29 473.5789 2020-11-11 18:50:03 362.597418
# 90d 434.50894 2020-08-15 15:30:29 487.2119 2020-09-01 22:17:01 316.774346
# 365d 185.43604 2019-11-14 15:30:29 487.2119 2020-09-01 00:00:00 95.184301
# all_time 2.83162 2015-08-07 00:00:00 1432.8800 2018-01-13 00:00:00 0.420897
# low_timestamp close close_timestamp percent_change
# 24h 2020-11-12 18:27:13 467.0052 2020-11-13 15:30:29 1.8666241
# yesterday 2020-11-12 18:27:13 461.0053 2020-11-12 23:59:59 -0.4221218
# 7d 2020-11-07 20:11:13 467.0052 2020-11-13 15:30:29 6.0834207
# 30d 2020-10-16 09:21:41 467.0052 2020-11-13 15:30:29 22.6808486
# 90d 2020-09-05 18:55:23 467.0052 2020-11-13 15:30:29 7.4788382
# 365d 2020-03-13 00:00:00 467.0052 2020-11-13 15:30:29 151.8416350
# all_time 2015-10-21 00:00:00 467.0052 2020-11-13 15:30:29 16392.5082715
按照@AllanCameron 的建议,我们可以使用Rselenium
和rvest
提取table。这是一个对我有用的脚本:
library(RSelenium)
library(rvest)
library(magrittr)
URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"
# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]
# Extract table from source
DF <- read_html(html) %>%
html_nodes("table") %>%
`[[`(3) %>%
html_table %>% data.frame
# Close connection
remDr$close()
我正在尝试提取 html table 并将其转换为 R 中的 data.frame 或 data.table。
我想提取包含比特币历史数据的table:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113
(完整 Xpath:/html/body/div/div[1]/div[2]/div[1]/div[2]/div[3]/div/ul[2]/li[5]/div/div/div[2]/div[3]/div/table
)
这是我目前尝试过的方法:
library(magrittr)
library(rvest)
URL <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20170101&end=20201113"
PRICES <- read_html(URL) %>% html_nodes("table")
然而,如您所见,历史价格 table 并未显示在输出中:
我的猜测是 table 在 页面的其余部分加载之后 加载。
理想情况下,我希望提取方法能够与其他加密货币历史 table 一起使用,例如:
https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113
你说得对 - 这个 table 是在页面加载后通过 XHR 调用动态加载的,所以你无法使用 rvest
获取它。也许最好的解决方案是找到生成 table 的 API 的地址。您可以使用浏览器中的开发人员工具执行此操作。然后您需要解析 json,这可能很棘手。例如,在您的情况下,我们可以这样做:
url <- paste0("https://web-api.coinmarketcap.com/v1/cryptocurrency/",
"price-performance-stats/latest?id=1027&include_volume=true&",
"time_period=all_time,24h,7d,30d,90d,365d,yesterday")
res <- httr::content(httr::GET(url), "parsed")$data$`1027`$periods
df <- do.call(rbind, lapply(res, function(x) unlist(x$quote$USD[1:9])))
df <- as.data.frame(df, stringsAsFactors = FALSE)
for(i in c(1, 3, 5, 7, 9)) df[[i]] <- as.numeric(df[[i]])
for(i in c(2, 4, 6, 8)) df[[i]] <- strptime(df[[i]], "%Y-%m-%dT%H:%M:%S")
df[rev(order(df$open_timestamp)),]
# open open_timestamp high high_timestamp low
# 24h 458.44767 2020-11-12 15:30:29 470.5202 2020-11-13 15:03:02 452.072417
# yesterday 462.95952 2020-11-12 00:00:00 467.6778 2020-11-12 10:08:13 452.072417
# 7d 440.22446 2020-11-06 15:30:29 473.5789 2020-11-11 18:50:25 428.456353
# 30d 380.66672 2020-10-14 15:30:29 473.5789 2020-11-11 18:50:03 362.597418
# 90d 434.50894 2020-08-15 15:30:29 487.2119 2020-09-01 22:17:01 316.774346
# 365d 185.43604 2019-11-14 15:30:29 487.2119 2020-09-01 00:00:00 95.184301
# all_time 2.83162 2015-08-07 00:00:00 1432.8800 2018-01-13 00:00:00 0.420897
# low_timestamp close close_timestamp percent_change
# 24h 2020-11-12 18:27:13 467.0052 2020-11-13 15:30:29 1.8666241
# yesterday 2020-11-12 18:27:13 461.0053 2020-11-12 23:59:59 -0.4221218
# 7d 2020-11-07 20:11:13 467.0052 2020-11-13 15:30:29 6.0834207
# 30d 2020-10-16 09:21:41 467.0052 2020-11-13 15:30:29 22.6808486
# 90d 2020-09-05 18:55:23 467.0052 2020-11-13 15:30:29 7.4788382
# 365d 2020-03-13 00:00:00 467.0052 2020-11-13 15:30:29 151.8416350
# all_time 2015-10-21 00:00:00 467.0052 2020-11-13 15:30:29 16392.5082715
按照@AllanCameron 的建议,我们可以使用Rselenium
和rvest
提取table。这是一个对我有用的脚本:
library(RSelenium)
library(rvest)
library(magrittr)
URL <- "https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20170101&end=20201113"
# Open firefox and extract source
rD <- rsDriver(browser = "firefox", verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate(URL)
html <- remDr$getPageSource()[[1]]
# Extract table from source
DF <- read_html(html) %>%
html_nodes("table") %>%
`[[`(3) %>%
html_table %>% data.frame
# Close connection
remDr$close()