使用 rvest 和 xml2 进行网页抓取
Web Scraping with rvest and xml2
我正在尝试从此 url 中抓取 COVID 相关公告的日期和政策类型:https://covid19.healthdata.org/united-states-of-america/alabama
我要提取的第一个日期是阿拉巴马州居家令的 "April 4th, 2020" 日期。
据我所知(因为我是新手),它有 xpath:
"//[@id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
我一直在使用以下几行来尝试检索它 -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[@id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
两者都无法提取我正在寻找的信息。如有任何帮助,我们将不胜感激!
此页面的数据存储在一系列 JSON 文件中。如果您使用浏览器中的开发人员工具并查找 XHR 类型的网络文件;您应该获得类似于此的列表(下面的 Safari 浏览器):
右键单击要复制的名称 URL link。
这个脚本应该让你开始:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest
我正在尝试从此 url 中抓取 COVID 相关公告的日期和政策类型:https://covid19.healthdata.org/united-states-of-america/alabama
我要提取的第一个日期是阿拉巴马州居家令的 "April 4th, 2020" 日期。
据我所知(因为我是新手),它有 xpath:
"//[@id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
我一直在使用以下几行来尝试检索它 -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[@id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
两者都无法提取我正在寻找的信息。如有任何帮助,我们将不胜感激!
此页面的数据存储在一系列 JSON 文件中。如果您使用浏览器中的开发人员工具并查找 XHR 类型的网络文件;您应该获得类似于此的列表(下面的 Safari 浏览器):
右键单击要复制的名称 URL link。
这个脚本应该让你开始:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest