使用 Rvest 抓取文本，table，并从多个页面中合并两者

Question

我有一种情况，我想跨不同的 url 抓取多个表。我确实设法抓取了一页，但是当我尝试跨页抓取并将表格堆叠为 dataframe/list.

时，我的功能失败了

library(rvest)
library(tidyverse)
library(purrr)

   index <-225:227
          urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
          
         
          get_gram <- function(url){
               urls %>%
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
                    html_text() -> temp
               urls %>% 
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
                    html_table() %>% 
                    as.data.frame() %>% add_column(newcol=str_c(temp))
          }
#results <- map_df(urls,get_gram) Have commented this out, but this is what i 
# used to get the table when the index just had one element and it worked.

results <- list()
results[[i]] <- map_df(urls,get_gram)

我想我在必须堆叠 map_df 输出的步骤上步履蹒跚，在此先感谢您的帮助！

Answer 1

您将 url 传递给函数并在函数主体中使用 urls。试试这个版本：

library(rvest)
library(dplyr)

index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)

get_gram <- function(url){
  webpage <- url %>%  read_html() 
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
    html_text() -> temp
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
    html_table() %>% 
    as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)

Answer 2

考虑一下这种方法。我们只需要使用 html_node，因为您的代码表明每页只有一个 table 要抓取。

library(tidyverse)
library(rvest)

get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()

urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)

tibble(urls) %>% 
  mutate(
    page = map(urls, read_html), 
    newcol = map_chr(page, get_title), 
    data = map(page, get_table), 
    page = NULL, urls = NULL
  ) %>% 
  unnest(data)

输出

# A tibble: 52 x 7
   newcol                                           `Ward No.` `Ward Name`      `Elected Members` Role      Party  Reservation
   <chr>                                                 <int> <chr>            <chr>             <chr>     <chr>  <chr>      
 1 Thiruvananthapuram - Chemmaruthy Grama Panchayat          1 VANDIPPURA       BABY P            Member    CPI(M) Woman      
 2 Thiruvananthapuram - Chemmaruthy Grama Panchayat          2 PALAYAMKUNNU     SREELATHA D       Member    INC    Woman      
 3 Thiruvananthapuram - Chemmaruthy Grama Panchayat          3 KOVOOR           KAVITHA V         Member    INC    Woman      
 4 Thiruvananthapuram - Chemmaruthy Grama Panchayat          4 SIVAPURAM        ANIL. V           Member    INC    General    
 5 Thiruvananthapuram - Chemmaruthy Grama Panchayat          5 MUTHANA          JAYALEKSHMI S     Member    INC    Woman      
 6 Thiruvananthapuram - Chemmaruthy Grama Panchayat          6 MAVINMOODU       S SASIKALA NATH   Member    CPI(M) Woman      
 7 Thiruvananthapuram - Chemmaruthy Grama Panchayat          7 NJEKKADU         P.MANILAL         Member    INC    General    
 8 Thiruvananthapuram - Chemmaruthy Grama Panchayat          8 CHEMMARUTHY      SASEENDRA         President INC    Woman      
 9 Thiruvananthapuram - Chemmaruthy Grama Panchayat          9 PANCHAYAT OFFICE PRASANTH PANAYARA Member    INC    General    
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat         10 VALIYAVILA       SANJAYAN S        Member    INC    General    
# ... with 42 more rows

使用 Rvest 抓取文本，table，并从多个页面中合并两者

Using Rvest to scrape text, table, and combine the two from multiple pages

screen-scraping

r

web-scraping

rvest

purrr