使用 rvest 抓取请求 cookie 同意的网站

Question

我想抓取（使用 rvest）一个要求用户同意设置 cookie 的网站。如果我只是抓取页面，rvest 只会下载弹出窗口。这是代码：

library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c") 
content %>% html_text()

结果好像是弹出的内容window征求同意。

有没有办法忽略或接受弹出窗口或提前设置 cookie，以便我可以访问网站的正文？

Answer 1

那个网站不是静态的，所以我认为没有办法使用 rvest 来抓取它（不过我很想被证明是错误的！）；另一种方法是使用 RSelenium 'click' 弹出窗口，然后抓取呈现的内容，例如

library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)

driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped

编辑：关于“非静态假设”的支持信息：

如果您检查网站在浏览器中的呈现方式，您会发现仅加载“基本文档”是不够的，但您需要支持 javascript。（来源：Chrome）

Answer 2

正如所建议的那样，该网站是动态的，这意味着它是由 javascript 构建的。通常，从 .js 文件重建（或直接不可能）这是如何完成的非常耗时，但在这种情况下，您实际上可以在浏览器的“网络分析”功能中看到，有一个非隐藏的api 提供您想要的信息。这是对 api.karriere.nrw.

的请求

因此，您可以使用 url 的 uuid（数据库中的标识符）并向 api 发出简单的 GET 请求，然后直接转到源代码，而无需通过 RSelenium 进行渲染，这是额外的时间和资源。

尽管保持友好，并向他们发送某种方式与您联系，以便他们告诉您停止。

library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe@company.com")

### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"

### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]

### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)

### get results
response <- httr::GET(api_url,
                    httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()

使用 rvest 抓取请求 cookie 同意的网站

Scrape site that asks for cookies consent with rvest

r

rvest