使用 R 从网站查找所有 csv 链接
Finding all csv links from website using R
我正在尝试从 ICE 网站 (https://www.theice.com/clear-us/risk-management#margin-rates) 下载包含保证金策略信息的数据文件。我试图通过在 R 中应用以下代码来做到这一点:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.csv") # find those that end in csv only
但是,它只找到两个 csv 文件。也就是说,它没有检测到单击 保证金率 并转到 历史 ICE 风险模型参数 时显示的任何文件。见下文
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
我想知道如何才能做到这一点,以便稍后我可以 select 文件并下载它们。
非常感谢
该页面似乎没有立即加载页面的那部分内容,您的请求中也没有。网络监视器指示文件“ClearUSRiskArrayFiles.shtml”在 400 毫秒后加载。一旦您在 URL.
中指定年份和月份,该文件似乎会提供所需的链接
library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
由 reprex package (v1.0.0)
于 2021 年 3 月 12 日创建
我们可以在浏览器开发工具中查看网络流量,以找到每个下拉操作的 url。
历史 ICE 风险模型参数 下拉列表从此页面提取:
https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
我们删除 jsessionid
(根据 QHarr 的评论)并将其用作我们的端点:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
然后我们可以获得完整的csv列表:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
输出:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...
我正在尝试从 ICE 网站 (https://www.theice.com/clear-us/risk-management#margin-rates) 下载包含保证金策略信息的数据文件。我试图通过在 R 中应用以下代码来做到这一点:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.csv") # find those that end in csv only
但是,它只找到两个 csv 文件。也就是说,它没有检测到单击 保证金率 并转到 历史 ICE 风险模型参数 时显示的任何文件。见下文
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
我想知道如何才能做到这一点,以便稍后我可以 select 文件并下载它们。
非常感谢
该页面似乎没有立即加载页面的那部分内容,您的请求中也没有。网络监视器指示文件“ClearUSRiskArrayFiles.shtml”在 400 毫秒后加载。一旦您在 URL.
中指定年份和月份,该文件似乎会提供所需的链接library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
由 reprex package (v1.0.0)
于 2021 年 3 月 12 日创建我们可以在浏览器开发工具中查看网络流量,以找到每个下拉操作的 url。
历史 ICE 风险模型参数 下拉列表从此页面提取: https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
我们删除 jsessionid
(根据 QHarr 的评论)并将其用作我们的端点:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
然后我们可以获得完整的csv列表:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
输出:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...