读取站点上的链接并将它们存储在列表中
Reading off links on a site and storing them in a list
我正在尝试从 StatsCan 中读取数据的 url,如下所示:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
这样做returns两个空列表;我很困惑,因为这适用于 link:https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087。最后我想遍历列表并读出每一页上的表格:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
为了提取所有 url,我建议使用 css 选择器“.field-item li a”并根据模式进行子集化。
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
您的 XPath 需要修复。您可以使用以下一个:
//strong[contains(.,"Oil")]/following-sibling::ul//a
我正在尝试从 StatsCan 中读取数据的 url,如下所示:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[@class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
这样做returns两个空列表;我很困惑,因为这适用于 link:https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087。最后我想遍历列表并读出每一页上的表格:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
为了提取所有 url,我建议使用 css 选择器“.field-item li a”并根据模式进行子集化。
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
您的 XPath 需要修复。您可以使用以下一个:
//strong[contains(.,"Oil")]/following-sibling::ul//a