使用 rvest 从网页中提取所有链接并存储到数据框中

Question

我正在尝试从以下网页中提取 links：https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5608#

例如，使用巴塞罗那，我会：

2.9.1 Indicadores de renta media y mediana https://www.ine.es/jaxiT3/Tabla.htm?t=30896&L=0
2.9.2 Distribución por fuente de ingresos https://www.ine.es/jaxiT3/Tabla.htm?t=30897&L=0
2.9.3 Porcentaje de población con ingresos por unidad de consumo por debajo de determinados umbrales fijos por sexo https://www.ine.es/jaxiT3/Tabla.htm?t=30898&L=0
2.9.4 Porcentaje de población con ingresos por unidad de consumo por debajo de determinados umbrales fijos por sexo y tramos de edad https://www.ine.es/jaxiT3/Tabla.htm?t=30899&L=0
...
2.9.10 Indicadores demográficos https://www.ine.es/jaxiT3/Tabla.htm?t=30904&L=0

我想为所有 provinces 执行此操作。当我运行以下内容时，我得到一个 NA.

library(rvest)
out <- read_html("https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5608#")

out %>% 
  html_attr("href")

编辑：

link如下：

https://www.ine.es/dynt3/inebase/index.htm?padre=5608

有两节没有展开。我可以使用以下内容：

library(rvest)

lnk <- "https://www.ine.es/dynt3/inebase/index.htm?padre=5608"
out <- read_html(lnk)

x <- out %>% 
  html_nodes('ol') %>% 
  html_nodes('li') %>% 
  html_nodes('a') %>% 
  html_attr('href') %>% 
  str_sub(-4, -1) %>% 
  paste(lnk, "&capsel=", ., sep = "")

给我这个输出：

"https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5650" "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=7132"

这些 link 中的每一个都扩展了这两个部分，现在我正在尝试提取这些 link 中的每一个（包含在这些部分中）。

EDIT2

做与上面相同的事情以获得扩展部分的 links I 运行:

x[2] %>% 
  read_html() %>% 
  html_nodes('ol') %>% 
  html_nodes('li') %>% 
  html_nodes('a') %>% 
  html_attr('href') %>% 
  str_sub(-4, -1) %>% 
  paste(lnk, "&capsel=", ., sep = "")

这给了我：

 [1] "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5650" "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=7132"
 [3] "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5609" "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5652"
 [5] "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5653" "https://www.ine.es/dynt3/inebase/index.htm?padre=5608&capsel=5654"

前两个结果对应的是我上一部分已经得到的link。我对前 2.

之后的 link 感兴趣

现在我想从“Provincia”的展开部分提取所有 links，即下图中的白色部分：

Answer 1

您可以编写一系列用户定义的 functions/helpers，在每种情况下提取扩展 url，最后获取您想要的详细信息作为 dataframe。您可以将所有 dataframes 与 purrr 中的 map2_dfr 合并为一个。我使用 map2_dfr 是因为在从每个 li 列表中检索您想要的详细信息时，您还需要更改 css 选择器的 nth-child 部分中的索引。这意味着 get_details 函数需要 2 个参数作为输入，即 url 和索引。

map2_dfr() is a variant of map() that allows you to iterate over multiple arguments simultaneously.

library(rvest)
library(tidyverse)
library(purrr)

get_expand_url <- function(url) {
  link <- read_html(url) %>%
    html_node(".inebase_capitulo:nth-child(2) .desplegar") %>%
    html_attr("href") %>%
    url_absolute(url)
  return(link)
}

get_provincias_links <- function(url) {
  provincias <- read_html(url) %>%
    html_nodes(".respuestas > .inebase_capitulo:nth-child(2) .inebase_capitulo [id^=c_]") %>%
    html_attr("href") %>%
    url_absolute(url)
  return(provincias)
}

get_details <- function(provincia_url, n) {
  node <- read_html(provincia_url) %>%
    html_node(sprintf(".respuestas > .inebase_capitulo:nth-child(2) .inebase_capitulo:nth-child(%i)", n))

  provincia <- node %>%
    html_node(xpath = ".//span/following-sibling::text()[1]") %>%
    html_text(trim = T)

  df <- data.frame(
    index = node %>%
      html_nodes(".indice:nth-child(n+3)") %>%
      html_text(),

    title = node %>%
      html_nodes("span +.titulo") %>%
      html_text(),

    link = node %>%
      html_nodes("span +.titulo") %>%
      html_attr("href") %>% url_absolute(start_url)
  )
  df$provincia <- provincia
  return(df)
}

start_url <- "https://www.ine.es/dynt3/inebase/index.htm?padre=5608"

expand_url <- get_expand_url(start_url)
provincias_links <- get_provincias_links(expand_url)
indices <- 1:length(provincias_links)

df <- purrr::map2_dfr(provincias_links, indices, .f = get_details)

使用 rvest 从网页中提取所有链接并存储到数据框中

Extracting all links from a webpage and storing into a data frame using rvest

r

web-scraping

rvest