使用 Rvest 抓取文本,table,并从多个页面中合并两者
Using Rvest to scrape text, table, and combine the two from multiple pages
我有一种情况,我想跨不同的 url 抓取多个表。我确实设法抓取了一页,但是当我尝试跨页抓取并将表格堆叠为 dataframe/list.
时,我的功能失败了
library(rvest)
library(tidyverse)
library(purrr)
index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
get_gram <- function(url){
urls %>%
read_html() %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
html_text() -> temp
urls %>%
read_html() %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>%
html_table() %>%
as.data.frame() %>% add_column(newcol=str_c(temp))
}
#results <- map_df(urls,get_gram) Have commented this out, but this is what i
# used to get the table when the index just had one element and it worked.
results <- list()
results[[i]] <- map_df(urls,get_gram)
我想我在必须堆叠 map_df 输出的步骤上步履蹒跚,在此先感谢您的帮助!
您将 url
传递给函数并在函数主体中使用 urls
。试试这个版本:
library(rvest)
library(dplyr)
index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
get_gram <- function(url){
webpage <- url %>% read_html()
webpage %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
html_text() -> temp
webpage %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>%
html_table() %>%
as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)
考虑一下这种方法。我们只需要使用 html_node
,因为您的代码表明每页只有一个 table 要抓取。
library(tidyverse)
library(rvest)
get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)
tibble(urls) %>%
mutate(
page = map(urls, read_html),
newcol = map_chr(page, get_title),
data = map(page, get_table),
page = NULL, urls = NULL
) %>%
unnest(data)
输出
# A tibble: 52 x 7
newcol `Ward No.` `Ward Name` `Elected Members` Role Party Reservation
<chr> <int> <chr> <chr> <chr> <chr> <chr>
1 Thiruvananthapuram - Chemmaruthy Grama Panchayat 1 VANDIPPURA BABY P Member CPI(M) Woman
2 Thiruvananthapuram - Chemmaruthy Grama Panchayat 2 PALAYAMKUNNU SREELATHA D Member INC Woman
3 Thiruvananthapuram - Chemmaruthy Grama Panchayat 3 KOVOOR KAVITHA V Member INC Woman
4 Thiruvananthapuram - Chemmaruthy Grama Panchayat 4 SIVAPURAM ANIL. V Member INC General
5 Thiruvananthapuram - Chemmaruthy Grama Panchayat 5 MUTHANA JAYALEKSHMI S Member INC Woman
6 Thiruvananthapuram - Chemmaruthy Grama Panchayat 6 MAVINMOODU S SASIKALA NATH Member CPI(M) Woman
7 Thiruvananthapuram - Chemmaruthy Grama Panchayat 7 NJEKKADU P.MANILAL Member INC General
8 Thiruvananthapuram - Chemmaruthy Grama Panchayat 8 CHEMMARUTHY SASEENDRA President INC Woman
9 Thiruvananthapuram - Chemmaruthy Grama Panchayat 9 PANCHAYAT OFFICE PRASANTH PANAYARA Member INC General
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat 10 VALIYAVILA SANJAYAN S Member INC General
# ... with 42 more rows
我有一种情况,我想跨不同的 url 抓取多个表。我确实设法抓取了一页,但是当我尝试跨页抓取并将表格堆叠为 dataframe/list.
时,我的功能失败了library(rvest)
library(tidyverse)
library(purrr)
index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
get_gram <- function(url){
urls %>%
read_html() %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
html_text() -> temp
urls %>%
read_html() %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>%
html_table() %>%
as.data.frame() %>% add_column(newcol=str_c(temp))
}
#results <- map_df(urls,get_gram) Have commented this out, but this is what i
# used to get the table when the index just had one element and it worked.
results <- list()
results[[i]] <- map_df(urls,get_gram)
我想我在必须堆叠 map_df 输出的步骤上步履蹒跚,在此先感谢您的帮助!
您将 url
传递给函数并在函数主体中使用 urls
。试试这个版本:
library(rvest)
library(dplyr)
index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
get_gram <- function(url){
webpage <- url %>% read_html()
webpage %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
html_text() -> temp
webpage %>%
html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>%
html_table() %>%
as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)
考虑一下这种方法。我们只需要使用 html_node
,因为您的代码表明每页只有一个 table 要抓取。
library(tidyverse)
library(rvest)
get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)
tibble(urls) %>%
mutate(
page = map(urls, read_html),
newcol = map_chr(page, get_title),
data = map(page, get_table),
page = NULL, urls = NULL
) %>%
unnest(data)
输出
# A tibble: 52 x 7
newcol `Ward No.` `Ward Name` `Elected Members` Role Party Reservation
<chr> <int> <chr> <chr> <chr> <chr> <chr>
1 Thiruvananthapuram - Chemmaruthy Grama Panchayat 1 VANDIPPURA BABY P Member CPI(M) Woman
2 Thiruvananthapuram - Chemmaruthy Grama Panchayat 2 PALAYAMKUNNU SREELATHA D Member INC Woman
3 Thiruvananthapuram - Chemmaruthy Grama Panchayat 3 KOVOOR KAVITHA V Member INC Woman
4 Thiruvananthapuram - Chemmaruthy Grama Panchayat 4 SIVAPURAM ANIL. V Member INC General
5 Thiruvananthapuram - Chemmaruthy Grama Panchayat 5 MUTHANA JAYALEKSHMI S Member INC Woman
6 Thiruvananthapuram - Chemmaruthy Grama Panchayat 6 MAVINMOODU S SASIKALA NATH Member CPI(M) Woman
7 Thiruvananthapuram - Chemmaruthy Grama Panchayat 7 NJEKKADU P.MANILAL Member INC General
8 Thiruvananthapuram - Chemmaruthy Grama Panchayat 8 CHEMMARUTHY SASEENDRA President INC Woman
9 Thiruvananthapuram - Chemmaruthy Grama Panchayat 9 PANCHAYAT OFFICE PRASANTH PANAYARA Member INC General
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat 10 VALIYAVILA SANJAYAN S Member INC General
# ... with 42 more rows