使用 rvest 抓取多个页面
Scraping using rvest through multiple pages
我正在尝试弄清楚如何调整下面的代码以抓取 100 多页的搜索结果。我查看了其他 SO 问题,看起来 'purrr' 可能是最好的库,但我一直在努力使任何建议发挥作用。 change.org 的 url 似乎因“&offset=10”、“&offset=20”而异。
如果我遗漏了任何建议或其他 SO 问题,我们将不胜感激。
library(rvest)
#Specifying the url for website to be scraped
url <- 'https://www.change.org/search?q=PPE'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrape HEADING & convert to text
title_data <- html_nodes(webpage,'.xs-mbs') %>%
html_text(trim = TRUE)
#Using CSS selectors to scrape DATE & convert to text / nb. trim =true
date_data <- html_nodes(webpage,'.symbol-clock+ span') %>%
html_text(trim = TRUE)
#Remove "Created" (not sure how to remove leading whitespace!!)
date_data <- gsub("Created","",date_data)
#Using CSS selectors to scrape SUPPORTER NUMBERS & convert to text
supporters_data <- html_nodes(webpage,'.symbol-supporters+ span') %>%
html_text(trim = TRUE)
#remove SPACE & 'supporters'
supporters_data<-gsub(" supporters","",supporters_data)
#Using CSS selectors to scrape PETITION TO/ADDRESSEE & convert to text
addressee_data <- html_nodes(webpage,'.xs-mbn .type-s') %>%
html_text(trim = TRUE)
#remove "Petition to"
addressee_data<-gsub("Petition to ","",addressee_data)
#Using CSS selectors to scrape LOCATION & convert to text
location_data <- html_nodes(webpage,'.plxxs') %>%
html_text(trim = TRUE)
#Combining all the lists to form a data frame
change_df<-data.frame(Title = title_data, Addressee = addressee_data, Date = date_data,
Supporters = supporters_data)
您可以在此处使用函数式方法 - 即构建一个辅助函数,只需将相同的函数应用于 url 的向量,即可抓取所有页面。这是一个复制您的代码的小函数:
scrape_change_page <- function(url)
{
webpage <- xml2::read_html(url)
get_text <- function(css)
{
vec <- rvest::html_text(rvest::html_nodes(webpage, css), trim = TRUE)
if(length(vec) < 10) c(vec, rep("", 10 - length(vec))) else vec
}
dplyr::tibble(
title_data = get_text('.xs-mbs'),
date_data = gsub("Created", "", get_text('.symbol-clock+ span')),
supporters_data = gsub(" supporters", "", get_text('.symbol-supporters+ span')),
addressee_data = gsub("Petition to ", "", get_text('.xs-mbn .type-s')),
location_data = get_text('.plxxs')
)
}
因此,如果我们在您的 url 上对其进行测试,我们会得到一个不错的数据框(此处它实际上显示为小标题只是为了格式化目的:
url <- 'https://www.change.org/search?q=PPE'
scrape_change_page(url)
#> # A tibble: 10 x 5
#> title_data date_data supporters_data addressee_data location_data
#> <chr> <chr> <chr> <chr> <chr>
#> 1 PPE FOR INDIAN~ 22 Mar 2~ 31,521 Government of Indi~ India
#> 2 Personal prote~ 27 Mar 2~ 992,282 Matthew Hancock MP~ Liverpool, EN~
#> 3 Government mus~ 3 May 20~ 208,747 UK Government London, ENG, ~
#> 4 PPE for NHS st~ 19 Mar 2~ 48,536 Matt Hancock Woking, ENG, ~
#> 5 URGENT NEED fo~ 28 Mar 2~ 36,773 Boris Johnson, Mat~ Hebburn, ENG,~
#> 6 PPE for all fr~ 1 Apr 20~ 12,963 Public Health Engl~ Hartlepool, E~
#> 7 Appropriate PP~ 29 Mar 2~ 4,996 Government United Kingdom
#> 8 PPE for all Tf~ 8 Apr 20~ 11,850 Transport For Lond~ Ilford, ENG, ~
#> 9 Mandate PPE fo~ 22 Apr 2~ 11,068 Kay Ivey, Michael ~ Anoka, MN, US
#> 10 Effective PPE ~ 28 Mar 2~ 24,588 Ministry of Health Christchurch,~
现在,如果我们想将多个页面放入一个数据帧中,我们会生成一个包含我们想要的 url 的向量,然后 lapply
这个函数会获取数据帧列表,然后我们 rbind
进入一个大数据框。这里似乎有 191 页要抓取,所以为了示例我只抓取 5 页:
n_pages <- 5
urls <- paste0(url, "&offset=", 10 * (seq(n_pages) - 1)
result <- do.call(rbind, lapply(urls, scrape_change_page))
result
#> # A tibble: 50 x 5
#> title_data date_data supporters_data addressee_data location_data
#> <chr> <chr> <chr> <chr> <chr>
#> 1 PPE FOR INDIAN~ 22 Mar 2~ 31,521 Government of Indi~ India
#> 2 Government mus~ 3 May 20~ 208,747 UK Government London, ENG, ~
#> 3 Personal prote~ 27 Mar 2~ 992,282 Matthew Hancock MP~ Liverpool, EN~
#> 4 PPE for NHS st~ 19 Mar 2~ 48,536 Matt Hancock Woking, ENG, ~
#> 5 PPE for all fr~ 1 Apr 20~ 12,963 Public Health Engl~ Hartlepool, E~
#> 6 Appropriate PP~ 29 Mar 2~ 4,996 Government United Kingdom
#> 7 URGENT NEED fo~ 28 Mar 2~ 36,773 Boris Johnson, Mat~ Hebburn, ENG,~
#> 8 PPE for all Tf~ 8 Apr 20~ 11,850 Transport For Lond~ Ilford, ENG, ~
#> 9 Mandate PPE fo~ 22 Apr 2~ 11,068 Kay Ivey, Michael ~ Anoka, MN, US
#> 10 Effective PPE ~ 28 Mar 2~ 24,588 Ministry of Health Christchurch,~
#> # ... with 40 more rows
您可以看到 result
有 50 行(每页 10 行)
由 reprex package (v0.3.0)
于 2020-07-08 创建
我正在尝试弄清楚如何调整下面的代码以抓取 100 多页的搜索结果。我查看了其他 SO 问题,看起来 'purrr' 可能是最好的库,但我一直在努力使任何建议发挥作用。 change.org 的 url 似乎因“&offset=10”、“&offset=20”而异。
如果我遗漏了任何建议或其他 SO 问题,我们将不胜感激。
library(rvest)
#Specifying the url for website to be scraped
url <- 'https://www.change.org/search?q=PPE'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrape HEADING & convert to text
title_data <- html_nodes(webpage,'.xs-mbs') %>%
html_text(trim = TRUE)
#Using CSS selectors to scrape DATE & convert to text / nb. trim =true
date_data <- html_nodes(webpage,'.symbol-clock+ span') %>%
html_text(trim = TRUE)
#Remove "Created" (not sure how to remove leading whitespace!!)
date_data <- gsub("Created","",date_data)
#Using CSS selectors to scrape SUPPORTER NUMBERS & convert to text
supporters_data <- html_nodes(webpage,'.symbol-supporters+ span') %>%
html_text(trim = TRUE)
#remove SPACE & 'supporters'
supporters_data<-gsub(" supporters","",supporters_data)
#Using CSS selectors to scrape PETITION TO/ADDRESSEE & convert to text
addressee_data <- html_nodes(webpage,'.xs-mbn .type-s') %>%
html_text(trim = TRUE)
#remove "Petition to"
addressee_data<-gsub("Petition to ","",addressee_data)
#Using CSS selectors to scrape LOCATION & convert to text
location_data <- html_nodes(webpage,'.plxxs') %>%
html_text(trim = TRUE)
#Combining all the lists to form a data frame
change_df<-data.frame(Title = title_data, Addressee = addressee_data, Date = date_data,
Supporters = supporters_data)
您可以在此处使用函数式方法 - 即构建一个辅助函数,只需将相同的函数应用于 url 的向量,即可抓取所有页面。这是一个复制您的代码的小函数:
scrape_change_page <- function(url)
{
webpage <- xml2::read_html(url)
get_text <- function(css)
{
vec <- rvest::html_text(rvest::html_nodes(webpage, css), trim = TRUE)
if(length(vec) < 10) c(vec, rep("", 10 - length(vec))) else vec
}
dplyr::tibble(
title_data = get_text('.xs-mbs'),
date_data = gsub("Created", "", get_text('.symbol-clock+ span')),
supporters_data = gsub(" supporters", "", get_text('.symbol-supporters+ span')),
addressee_data = gsub("Petition to ", "", get_text('.xs-mbn .type-s')),
location_data = get_text('.plxxs')
)
}
因此,如果我们在您的 url 上对其进行测试,我们会得到一个不错的数据框(此处它实际上显示为小标题只是为了格式化目的:
url <- 'https://www.change.org/search?q=PPE'
scrape_change_page(url)
#> # A tibble: 10 x 5
#> title_data date_data supporters_data addressee_data location_data
#> <chr> <chr> <chr> <chr> <chr>
#> 1 PPE FOR INDIAN~ 22 Mar 2~ 31,521 Government of Indi~ India
#> 2 Personal prote~ 27 Mar 2~ 992,282 Matthew Hancock MP~ Liverpool, EN~
#> 3 Government mus~ 3 May 20~ 208,747 UK Government London, ENG, ~
#> 4 PPE for NHS st~ 19 Mar 2~ 48,536 Matt Hancock Woking, ENG, ~
#> 5 URGENT NEED fo~ 28 Mar 2~ 36,773 Boris Johnson, Mat~ Hebburn, ENG,~
#> 6 PPE for all fr~ 1 Apr 20~ 12,963 Public Health Engl~ Hartlepool, E~
#> 7 Appropriate PP~ 29 Mar 2~ 4,996 Government United Kingdom
#> 8 PPE for all Tf~ 8 Apr 20~ 11,850 Transport For Lond~ Ilford, ENG, ~
#> 9 Mandate PPE fo~ 22 Apr 2~ 11,068 Kay Ivey, Michael ~ Anoka, MN, US
#> 10 Effective PPE ~ 28 Mar 2~ 24,588 Ministry of Health Christchurch,~
现在,如果我们想将多个页面放入一个数据帧中,我们会生成一个包含我们想要的 url 的向量,然后 lapply
这个函数会获取数据帧列表,然后我们 rbind
进入一个大数据框。这里似乎有 191 页要抓取,所以为了示例我只抓取 5 页:
n_pages <- 5
urls <- paste0(url, "&offset=", 10 * (seq(n_pages) - 1)
result <- do.call(rbind, lapply(urls, scrape_change_page))
result
#> # A tibble: 50 x 5
#> title_data date_data supporters_data addressee_data location_data
#> <chr> <chr> <chr> <chr> <chr>
#> 1 PPE FOR INDIAN~ 22 Mar 2~ 31,521 Government of Indi~ India
#> 2 Government mus~ 3 May 20~ 208,747 UK Government London, ENG, ~
#> 3 Personal prote~ 27 Mar 2~ 992,282 Matthew Hancock MP~ Liverpool, EN~
#> 4 PPE for NHS st~ 19 Mar 2~ 48,536 Matt Hancock Woking, ENG, ~
#> 5 PPE for all fr~ 1 Apr 20~ 12,963 Public Health Engl~ Hartlepool, E~
#> 6 Appropriate PP~ 29 Mar 2~ 4,996 Government United Kingdom
#> 7 URGENT NEED fo~ 28 Mar 2~ 36,773 Boris Johnson, Mat~ Hebburn, ENG,~
#> 8 PPE for all Tf~ 8 Apr 20~ 11,850 Transport For Lond~ Ilford, ENG, ~
#> 9 Mandate PPE fo~ 22 Apr 2~ 11,068 Kay Ivey, Michael ~ Anoka, MN, US
#> 10 Effective PPE ~ 28 Mar 2~ 24,588 Ministry of Health Christchurch,~
#> # ... with 40 more rows
您可以看到 result
有 50 行(每页 10 行)
由 reprex package (v0.3.0)
于 2020-07-08 创建