使用 rvest 和 xml2 进行网页抓取

Question

我正在尝试从此 url 中抓取 COVID 相关公告的日期和政策类型：https://covid19.healthdata.org/united-states-of-america/alabama

我要提取的第一个日期是阿拉巴马州居家令的 "April 4th, 2020" 日期。

据我所知（因为我是新手），它有 xpath:

 "//[@id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"

我一直在使用以下几行来尝试检索它 -

data <- read_html(url) %>% 
  html_nodes("span.ant-statistic-content-value")

data <- read_html(url) %>%
  html_nodes(xpath = "//*[@id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")

两者都无法提取我正在寻找的信息。如有任何帮助，我们将不胜感激！

Answer 1

此页面的数据存储在一系列 JSON 文件中。如果您使用浏览器中的开发人员工具并查找 XHR 类型的网络文件；您应该获得类似于此的列表（下面的 Safari 浏览器）：

右键单击要复制的名称 URL link。

这个脚本应该让你开始：

library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)

head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])

#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)

Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id   covid_intervention_measure_name
# 1 2020-03-30 00:00:00                   110         569                             1 People instructed to stay at home
# 2 2020-03-16 00:00:00                   258         569                             2     Educational facilities closed
# 3 2020-04-19 00:00:00                   437         569                             7          Assumed_implemented_date

#Repeat for other links of interest

使用 rvest 和 xml2 进行网页抓取

Web Scraping with rvest and xml2

html

r

rvest