rvest 遇到麻烦

Question

我正在尝试抓取 HTML 或 JSON 引用世界各地经济学家的网站中的文件。这是我试图利用的页面示例： https://ideas.repec.org/f/pan296.html

更准确地说，我正在尝试抓取单击按钮 “导出引用” 时显示的数据，在 [=29] =]、HTML 或其他任何内容。这是我的工作：

  test <- rvest::html_session("https://ideas.repec.org/f/pan296.html") %>% jump_to("https://ideas.repec.org/cgi-bin/refs.cgi")
  test$response

连接正常，但输出为空：

Response [https://ideas.repec.org/cgi-bin/refs.cgi]
  Date: 2020-07-13 08:50
  Status: 200
  Content-Type: text/plain; charset=utf-8
<EMPTY BODY>

有什么想法吗？

Answer 1

正如阿齐兹所说，你必须观察 POST 请求来重建它。但在这种情况下，由于新选项卡中的请求，工作可能会很棘手。按照此主题查看如何观察在新选项卡中打开的请求：Chrome Dev Tools: How to trace network for a link that opens a new tab?

获取导出内容的代码：

library(rvest)

url <- "https://ideas.repec.org/f/pan296.html"
pg <- html_session(url)
handle_value <- pg %>% html_node(xpath = "//form/input[@name='handle']") %>% html_attr("value")
pg <- pg %>% rvest:::request_POST(url = "https://ideas.repec.org/cgi-bin/refs.cgi",
                                  body = list("handle"= handle_value,
                                              "ref" = "Export references ",
                                              "output" = "0"))

pg$response

(更改output数值以获得不同的输出格式，0用于HTML)

rvest 遇到麻烦

In trouble with rvest

r

web-scraping

rvest