如何使用 R 抓取 NSE 指数 (NIFTY 50) 的基本面数据
How to scrape fundamentals data of NSE indices (NIFTY 50) using R
我正在尝试从 nse 网站 (link) 抓取基本面数据 table(市盈率、市净率和股息收益率)。我从 rvest 包中尝试了以下内容:
url = "https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm"
pgsession <-html_session(url)
但是,我收到这个错误:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
此外,我尝试了 httr 包(css 选择器使用 Chrome 扩展名 'SelectorGadget' 识别)
fd <- list(submit = "Get Data", # Not Sure if it's the correct css selector
IndexName = "NIFTY 50",
fromDate = "01-06-2020",
toDate = "15-06-2020" )
resp<-POST(url, body=fd, encode="form")
但是,我收到了同样的错误。我浏览了很多论坛来解决问题,但网站似乎阻止了抓取尝试。有人可以验证这一点或提供一种从该网站抓取 table 的方法吗?
如果您右击该页面,单击 'Inspect element',然后转到 'Network' 选项卡,您可以在单击 'Get data' 按钮时看到正在发出的请求。
在这种情况下,请求是下面的 URL,可以使用例如 rvest::html_table()
.
轻松读取和解析成数据帧
通过更改 URL 我敢肯定你可以提取你想要的 table。
url <- "https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=NIFTY%2050&fromDate=01-06-2020&toDate=02-06-2020&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all"
rvest::html_table(xml2::read_html(url))[[1]]
给予
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 Date P/E
3 01-Jun-2020 22.96
4 02-Jun-2020 23.31
5 Download file in csv format Download file in csv format
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 P/B Div Yield
3 2.80 1.55
4 2.84 1.53
5 Download file in csv format Download file in csv format
这是一个(粗略的)包装器,用于从 NSE 网站获取 NIFTY 50 Fundamentals 的数据
get.nse.ratios <- function(index.nse = 'NIFTY 50', date.start = as.Date('2001-01-01'), date.end = as.Date(Sys.time())){
# url.base <- 'https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm'
index.nse <- gsub(' ', '%20', index.nse)
# Split Date range into acceptable range
max.history.constraint <- 100
dates.start <- seq.Date(date.start, date.end, by = max.history.constraint)
data.master <- data.frame()
# Loop over sub-periods to extract data
for(fromDate in dates.start){
toDate <- min(fromDate+(max.history.constraint - 1), as.Date(Sys.Date()))
cat(sprintf('Fetching data from %s to %s \n', as.Date(fromDate), as.Date(toDate)))
# browser()
# Reformat dates
fromDate <- format.Date(as.Date(fromDate), '%d-%m-%Y')
toDate <- format.Date(as.Date(toDate), '%d-%m-%Y')
# Infer url for sub-period
url.sub <- sprintf("https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=%s&fromDate=%s&toDate=%s&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all", index.nse, fromDate, toDate)
# Scrape table from inferred url
data.sub <- rvest::html_table(xml2::read_html(url.sub))[[1]]
# Clean the table
names.columns <- unname(unlist(data.sub[2,]))
data.clean <- data.sub[3:(nrow(data.sub)-1),]
colnames(data.clean) <- names.columns
data.clean$Date <- as.Date(data.clean$Date, format = '%d-%b-%Y')
cols.num <- names(which(sapply(data.clean, class) == 'character'))
data.clean[cols.num] <- sapply(data.clean[cols.num],as.numeric)
# Append to master data
data.master <- rbind(data.master, data.clean)
}
return(data.master)
}
我正在尝试从 nse 网站 (link) 抓取基本面数据 table(市盈率、市净率和股息收益率)。我从 rvest 包中尝试了以下内容:
url = "https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm"
pgsession <-html_session(url)
但是,我收到这个错误:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
此外,我尝试了 httr 包(css 选择器使用 Chrome 扩展名 'SelectorGadget' 识别)
fd <- list(submit = "Get Data", # Not Sure if it's the correct css selector
IndexName = "NIFTY 50",
fromDate = "01-06-2020",
toDate = "15-06-2020" )
resp<-POST(url, body=fd, encode="form")
但是,我收到了同样的错误。我浏览了很多论坛来解决问题,但网站似乎阻止了抓取尝试。有人可以验证这一点或提供一种从该网站抓取 table 的方法吗?
如果您右击该页面,单击 'Inspect element',然后转到 'Network' 选项卡,您可以在单击 'Get data' 按钮时看到正在发出的请求。
在这种情况下,请求是下面的 URL,可以使用例如 rvest::html_table()
.
通过更改 URL 我敢肯定你可以提取你想要的 table。
url <- "https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=NIFTY%2050&fromDate=01-06-2020&toDate=02-06-2020&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all"
rvest::html_table(xml2::read_html(url))[[1]]
给予
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 Date P/E
3 01-Jun-2020 22.96
4 02-Jun-2020 23.31
5 Download file in csv format Download file in csv format
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 P/B Div Yield
3 2.80 1.55
4 2.84 1.53
5 Download file in csv format Download file in csv format
这是一个(粗略的)包装器,用于从 NSE 网站获取 NIFTY 50 Fundamentals 的数据
get.nse.ratios <- function(index.nse = 'NIFTY 50', date.start = as.Date('2001-01-01'), date.end = as.Date(Sys.time())){
# url.base <- 'https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm'
index.nse <- gsub(' ', '%20', index.nse)
# Split Date range into acceptable range
max.history.constraint <- 100
dates.start <- seq.Date(date.start, date.end, by = max.history.constraint)
data.master <- data.frame()
# Loop over sub-periods to extract data
for(fromDate in dates.start){
toDate <- min(fromDate+(max.history.constraint - 1), as.Date(Sys.Date()))
cat(sprintf('Fetching data from %s to %s \n', as.Date(fromDate), as.Date(toDate)))
# browser()
# Reformat dates
fromDate <- format.Date(as.Date(fromDate), '%d-%m-%Y')
toDate <- format.Date(as.Date(toDate), '%d-%m-%Y')
# Infer url for sub-period
url.sub <- sprintf("https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=%s&fromDate=%s&toDate=%s&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all", index.nse, fromDate, toDate)
# Scrape table from inferred url
data.sub <- rvest::html_table(xml2::read_html(url.sub))[[1]]
# Clean the table
names.columns <- unname(unlist(data.sub[2,]))
data.clean <- data.sub[3:(nrow(data.sub)-1),]
colnames(data.clean) <- names.columns
data.clean$Date <- as.Date(data.clean$Date, format = '%d-%b-%Y')
cols.num <- names(which(sapply(data.clean, class) == 'character'))
data.clean[cols.num] <- sapply(data.clean[cols.num],as.numeric)
# Append to master data
data.master <- rbind(data.master, data.clean)
}
return(data.master)
}