如何使用 R 抓取 NSE 指数 (NIFTY 50) 的基本面数据

How to scrape fundamentals data of NSE indices (NIFTY 50) using R

我正在尝试从 nse 网站 (link) 抓取基本面数据 table(市盈率、市净率和股息收益率)。我从 rvest 包中尝试了以下内容:

url = "https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm"
pgsession <-html_session(url)

但是,我收到这个错误:

Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60

此外,我尝试了 httr 包(css 选择器使用 Chrome 扩展名 'SelectorGadget' 识别)

fd <- list(submit = "Get Data", # Not Sure if it's the correct css selector 
IndexName = "NIFTY 50", 
fromDate = "01-06-2020", 
toDate = "15-06-2020" ) 

resp<-POST(url, body=fd, encode="form")

但是,我收到了同样的错误。我浏览了很多论坛来解决问题,但网站似乎阻止了抓取尝试。有人可以验证这一点或提供一种从该网站抓取 table 的方法吗?

如果您右击该页面,单击 'Inspect element',然后转到 'Network' 选项卡,您可以在单击 'Get data' 按钮时看到正在发出的请求。

在这种情况下,请求是下面的 URL,可以使用例如 rvest::html_table().

轻松读取和解析成数据帧

通过更改 URL 我敢肯定你可以提取你想要的 table。

url <- "https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=NIFTY%2050&fromDate=01-06-2020&toDate=02-06-2020&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all"

rvest::html_table(xml2::read_html(url))[[1]]

给予

  Historical NIFTY 50  P/E, P/B & Div. Yield values Historical NIFTY 50  P/E, P/B & Div. Yield values
1           For the period 01-06-2020 to 02-06-2020           For the period 01-06-2020 to 02-06-2020
2                                              Date                                               P/E
3                                       01-Jun-2020                                             22.96
4                                       02-Jun-2020                                             23.31
5                       Download file in csv format                       Download file in csv format
  Historical NIFTY 50  P/E, P/B & Div. Yield values Historical NIFTY 50  P/E, P/B & Div. Yield values
1           For the period 01-06-2020 to 02-06-2020           For the period 01-06-2020 to 02-06-2020
2                                               P/B                                         Div Yield
3                                              2.80                                              1.55
4                                              2.84                                              1.53
5                       Download file in csv format                       Download file in csv format

这是一个(粗略的)包装器,用于从 NSE 网站获取 NIFTY 50 Fundamentals 的数据

get.nse.ratios <- function(index.nse = 'NIFTY 50', date.start = as.Date('2001-01-01'), date.end = as.Date(Sys.time())){
  # url.base <- 'https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm'
  index.nse <- gsub(' ', '%20', index.nse)
  
  # Split Date range into acceptable range
  max.history.constraint <- 100
  dates.start <- seq.Date(date.start, date.end, by = max.history.constraint)
  data.master <- data.frame()
  # Loop over sub-periods to extract data
  for(fromDate in dates.start){
    toDate <- min(fromDate+(max.history.constraint - 1), as.Date(Sys.Date()))
    
    cat(sprintf('Fetching data from %s to %s \n', as.Date(fromDate), as.Date(toDate)))
    # browser()
    # Reformat dates
    fromDate <- format.Date(as.Date(fromDate), '%d-%m-%Y')
    toDate <- format.Date(as.Date(toDate), '%d-%m-%Y')
    
    # Infer url for sub-period
    url.sub <- sprintf("https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=%s&fromDate=%s&toDate=%s&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all", index.nse, fromDate, toDate)
    
    # Scrape table from inferred url
    data.sub <- rvest::html_table(xml2::read_html(url.sub))[[1]]
    
    # Clean the table
    names.columns <- unname(unlist(data.sub[2,]))
    data.clean <- data.sub[3:(nrow(data.sub)-1),]
    colnames(data.clean) <- names.columns
    data.clean$Date <- as.Date(data.clean$Date, format = '%d-%b-%Y')
    cols.num <- names(which(sapply(data.clean, class) == 'character'))
    data.clean[cols.num] <- sapply(data.clean[cols.num],as.numeric)
    
    # Append to master data
    data.master <- rbind(data.master, data.clean)

  }
  
  return(data.master)
}