在不使用 Selenium 的情况下使用 R 进行 Web 抓取

Question

我正在尝试使用 R but not Selenium (RSelenium).

寻找一些方法来抓取页面“https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/balancesheet/tcs”中的表格

下面我试过了-

library(rvest)
Link = 'https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/balancesheet/tcs'
read_html(Link) %>% html_nodes("#Table1") %>% html_text()
## character(0)

但是使用这段代码，我得到的是空白值。

非常感谢正确的指示。

Answer 1

table 不在您从站点请求的 html 中。它由页面上的 javascript 通过 xhr POST 请求动态加载。您可以在 Chrome 或 Firefox 开发者工具中找到它。

好消息是，您仍然可以通过遵循与您的浏览器相同的 link 在 R 中获得您想要的内容：

library(httr)
library(rvest)

base_url <- "https://www.icicidirect.com/idirectcontent/"
url1 <- paste0(base_url, "Research/TechnicalAnalysis.aspx/balancesheet/tcs")
url2 <- paste0(base_url, "basemasterpage/ContentDataHandler.ashx?icicicode=TCS")

response_1 <- GET(url1) # This is the page you can't scrape

# Set the parameters for the POST call (found from developer tools)
parameters <- list(pgname = "BalanceSheet_NonBanking",
                   ismethodcall = 0,
                   mthname = "")

# Now post the form and we'll get our table as a response
response_2 <- POST(url2, body = parameters)

# Process it as you did before:
read_html(response_2) %>% html_nodes("#Table1") %>% html_text()

在不使用 Selenium 的情况下使用 R 进行 Web 抓取

Web scraping using R without using Selenium

r

web-scraping

rvest