用 rvest 抓取网站:"Current page doesn't appear to be html."
Scraping a website with rvest: "Current page doesn't appear to be html."
我尝试访问这个网站:https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal
但是,我收到错误消息:当前页面似乎不是 html。
因此我无法继续使用 html_nodes
等抓取网站
这是我的代码:
apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"
apa_page <- rvest::html_session(apa_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
如果您知道如何修复它,我将不胜感激!
您还没有分享您想要抓取的内容,但您不需要创建 session。
例如,要获取第一页的期刊标题,您可以这样做:
library(rvest)
apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"
apa_url %>%
read_html() %>%
html_nodes('section.sresults li a') %>%
html_text()
# [1] "American Journal of Orthopsychiatry - APA Publishing | APA"
# [2] "American Psychologist Journal - APA Publishing | APA"
# [3] "Archives of Scientific Psychology"
# [4] "Asian American Journal of Psychology"
# [5] "Behavior Analysis: Research and Practice"
# [6] "Behavioral Development"
#...
#...
我尝试访问这个网站:https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal
但是,我收到错误消息:当前页面似乎不是 html。
因此我无法继续使用 html_nodes
等抓取网站
这是我的代码:
apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"
apa_page <- rvest::html_session(apa_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
如果您知道如何修复它,我将不胜感激!
您还没有分享您想要抓取的内容,但您不需要创建 session。
例如,要获取第一页的期刊标题,您可以这样做:
library(rvest)
apa_url <- "https://www.apa.org/pubs/journals/browse?query=Title:*&type=journal"
apa_url %>%
read_html() %>%
html_nodes('section.sresults li a') %>%
html_text()
# [1] "American Journal of Orthopsychiatry - APA Publishing | APA"
# [2] "American Psychologist Journal - APA Publishing | APA"
# [3] "Archives of Scientific Psychology"
# [4] "Asian American Journal of Psychology"
# [5] "Behavior Analysis: Research and Practice"
# [6] "Behavioral Development"
#...
#...