用于下载站点上提供的所有 pdf 的 R 代码:网络抓取
R code for downloading all the pdfs given on a site: Web scraping
我想用 R 编写代码,它可以下载此 URL 上给出的所有 pdf:https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy
然后将所有pdf下载到一个文件夹中。我在 https://towardsdatascience.com 的帮助下尝试了以下代码,但代码出错了
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?
head=Handbook%20of%20Statistics%20on%20Indian%20Economy") %>%
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb")
}
我无法解释为什么代码会出错。如果有人可以帮助我。
有小错误。该网站使用大写字母作为 PDF 结尾,您不需要使用 str_c("https://rbi.org.in", .)
。最后,我认为使用 purrr 的 walk2 功能更流畅(因为它可能在原始代码中)。
我还没有执行代码,因为我不需要那么多 pdf,所以,如果有效,请报告。
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.PDF") %>%
walk2(., basename(.), download.file, mode = "wb")
在尝试 运行 您的代码时,我 运行 进入“验证您是人类”和“请确保您的浏览器已 Javascript 启用”对话框。这表明您无法使用 Rvest 打开页面,但您需要使用 RSelenium browser automation 代替。
这是使用 RSelenium 的修改版本
library(tidyverse)
library(stringr)
library(purrr)
library(rvest)
library(RSelenium)
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html
html %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\.PDF") -> filenames
for(u in 1:length(urls)) {
cat(paste('downloading: ', u, ' of ', length(urls) '\n'))
download.file(urls[u], filenames[u], mode='wb')
Sys.sleep(1)
}
我想用 R 编写代码,它可以下载此 URL 上给出的所有 pdf:https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy 然后将所有pdf下载到一个文件夹中。我在 https://towardsdatascience.com 的帮助下尝试了以下代码,但代码出错了
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?
head=Handbook%20of%20Statistics%20on%20Indian%20Economy") %>%
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb")
}
我无法解释为什么代码会出错。如果有人可以帮助我。
有小错误。该网站使用大写字母作为 PDF 结尾,您不需要使用 str_c("https://rbi.org.in", .)
。最后,我认为使用 purrr 的 walk2 功能更流畅(因为它可能在原始代码中)。
我还没有执行代码,因为我不需要那么多 pdf,所以,如果有效,请报告。
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\.PDF") %>%
walk2(., basename(.), download.file, mode = "wb")
在尝试 运行 您的代码时,我 运行 进入“验证您是人类”和“请确保您的浏览器已 Javascript 启用”对话框。这表明您无法使用 Rvest 打开页面,但您需要使用 RSelenium browser automation 代替。
这是使用 RSelenium 的修改版本
library(tidyverse)
library(stringr)
library(purrr)
library(rvest)
library(RSelenium)
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html
html %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\.PDF") -> filenames
for(u in 1:length(urls)) {
cat(paste('downloading: ', u, ' of ', length(urls) '\n'))
download.file(urls[u], filenames[u], mode='wb')
Sys.sleep(1)
}