抓取只有在提交 aspx 表单后才能访问的网页

Question

我需要抓取仅在提交 aspx 表单后才可见的 table：选中 https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx（"All Years" 和 "All Surveys"）。我尝试使用 rvest 获取表格，但它似乎没有抓住我需要的表格：

require(rvest)
#> Loading required package: rvest
#> Loading required package: xml2

url <- "https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx"

sesh <- html_session(url)

forms <- sesh %>% html_nodes("form") %>% html_form()

forms
#> [[1]]
#> <form> 'HeaderSearch' (GET /search/search_redirect.asp)
#>   <input text> 'Search': Search
#>   <input hidden> 'website': NCES
#>   <input submit> '': Go
#> 
#> [[2]]
#> <form> 'search-box' (GET http://nces.ed.gov/search)
#>   <input hidden> 'output': xml_no_dtd
#>   <input hidden> 'client': nces
#>   <input hidden> 'site': nces
#>   <input hidden> 'sitesearch': nces.ed.gov/ipeds
#>   <input text> 'q': Search IPEDS
#>   <input image> '':

^{由 reprex package (v0.3.0)}

于 2020-03-23 创建

第一个列表项是 header 搜索栏。第二个可能是表单，但如果是的话，它没有提交值。

我可以使用一些帮助来弄清楚如何模拟该表单提交以便我可以获取文件的 table，或者弄清楚是否有一个 url 可以达到相同的目的结果页面。

Answer 1

这很棘手，但有可能。

您遇到的第一个困难是，当您向 url“https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx", you are sending it without any session cookies. This makes the server redirect you to a different page, "https://nces.ed.gov/ipeds/use-the-data”发送 GET 请求（通过 html_session）时，您看到的就是这个页面在你的变量 sesh.

但是，由于 rvest（实际上是 rvest 下的 httr）re-uses session 句柄，解决这个问题所需要做的就是导航到登录页面，它允许 httr 获取session 您需要以匿名用户身份浏览的 cookie。

在这里，我们还将我们的用户代理设置为 firefox。

library(httr)
library(rvest)
library(tibble)

url1    <- "https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=8"
url2    <- "https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx"

UA      <- "Mozilla/5.0 (Windows NT 6.1; rv:75.0) Gecko/20100101 Firefox/75.0"

html <- GET(url1, user_agent(UA))
html <- GET(url2, user_agent(UA))
page <- html %>% read_html()

现在 page 包含带有您要提交的表单的页面。这就是我们遇到第二个困难的地方。发送表单的最简单方法是使用 rvest::submit_form()，但这似乎不起作用，因为并非所有字段都是完整的。因此，我们需要使用 rvest 的抓取工具手动构建表单：

form <- list(`__VIEWSTATE` = page %>%
                html_node(xpath = "//input[@name='__VIEWSTATE']") %>%
                html_attr("value"),
             `__VIEWSTATEGENERATOR` = page %>%
                html_node(xpath = "//input[@name='__VIEWSTATEGENERATOR']") %>%
                html_attr("value"),
             `__EVENTVALIDATION` = page %>%
                html_node(xpath = "//input[@name='__EVENTVALIDATION']") %>%
                html_attr("value"),
             `ctl00$contentPlaceHolder$ddlYears` = "-1",
             `ddlSurveys` = "-1",
             `ctl00$contentPlaceHolder$ibtnContinue.x` = sample(50, 1),
             `ctl00$contentPlaceHolder$ibtnContinue.y` = sample(20, 1))

我们现在可以提交这个表单，但是在我们这样做之前，我们需要添加一些headers，没有它服务器将抛出一个http 500:

Headers <- add_headers(`Accept-Encoding` = "gzip, deflate, br", 
                       `Accept-Language` = "en-GB,en;q=0.5", 
                       `Connection` = "keep-alive", 
                       `Host` = "nces.ed.gov", 
                       `Origin` = "https://nces.ed.gov", 
                       `Referer` = url2, 
                       `Upgrade-Insecure-Requests` = "1")

最后，有一个通常通过 javascript 添加的 cookie，我们需要手动添加：

Cookies <- set_cookies(setNames(c(cookies(html)$value, "true"),
                                c(cookies(html)$name, "fromIpeds")))

现在我们可以 post 使用正确的表单 headers 和 cookie 来获取您想要的页面：

Result  <- POST(url2, body = form, user_agent(UA), Headers, Cookies)

您现在可以随心所欲地抓取此页面。作为示例，我将展示结果 table 的文本可以很容易地被抓取：

Result %>% 
 read_html() %>% 
 html_node("#contentPlaceHolder_tblResult") %>% 
 html_table() %>%
 as_tibble()
#> # A tibble: 1,090 x 7
#>     Year Survey    Title        `Data File` `Stata Data Fil~ Programs Dictionary
#>    <int> <chr>     <chr>        <chr>       <chr>            <chr>    <chr>     
#>  1  2018 Institut~ Directory i~ HD2018      HD2018_STATA     SPSS, S~ Dictionary
#>  2  2018 Institut~ Educational~ IC2018      IC2018_STATA     SPSS, S~ Dictionary
#>  3  2018 Institut~ Student cha~ IC2018_AY   IC2018_AY_STATA  SPSS, S~ Dictionary
#>  4  2018 Institut~ Student cha~ IC2018_PY   IC2018_PY_STATA  SPSS, S~ Dictionary
#>  5  2018 Institut~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#>  6  2018 12-Month~ 12-month un~ EFFY2018    EFFY2018_STATA   SPSS, S~ Dictionary
#>  7  2018 12-Month~ 12-month in~ EFIA2018    EFIA2018_STATA   SPSS, S~ Dictionary
#>  8  2018 12-Month~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#>  9  2018 Admissio~ Admission c~ ADM2018     ADM2018_STATA    SPSS, S~ Dictionary
#> 10  2018 Admissio~ Response st~ FLAGS2018   FLAGS2018_STATA  SPSS, S~ Dictionary
#> # ... with 1,080 more rows

^{由 reprex package (v0.3.0)}

于 2020-03-31 创建

抓取只有在提交 aspx 表单后才能访问的网页

Scraping web page that is only accesible after submitting aspx form

r

rvest