用 rvest 抓取网页分页。分页路径没有出现在结构中

Question

我需要你的帮助来解决网络抓取问题。我正在尝试从网站上抓取新闻。但是我在抓取总分页数时遇到了问题。

例如，在此页面上，我想抓取此分页 (166)。但是分页路径不在站点结构中：

url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/'

read_html(url) %>%
  html_nodes("#wrapper .nav-links > a") %>%
  html_attr("href") %>% 
  str_trim()


read_html(url) %>%
  html_nodes("#wrapper > #content > .site-content > .container > .row > div > div > div > nav > .nav-links > a") %>%
  html_attr("href") %>% 
  str_trim()

我已经尝试了所有的节点，但没有。谢谢

Answer 1

既然知道总页数是 166，为什么还需要抓取总页数？只需循环 1:166 :

url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/page/'

data <- 
  purrr::map_dfr(
    1:166,
    function(x) {
      articles <- read_html(paste0(url, x)) %>%
        html_nodes(xpath = "//div[@class='posts-lists']/div/article")
      data.frame(
        id = articles %>% html_attr("id"),
        title = articles %>% html_nodes("h2") %>% html_text(),
        link = articles %>% html_nodes("h2 > a") %>% html_attr("href"),
        author = articles %>% html_nodes(xpath = "//a[@rel='author']") %>% html_text()
      )
    }
  )

Answer 2

这个数字在 class .pages 中很明显。使用省略号的 class 作为前面的锚点，并移动到具有相邻兄弟组合器的所需节点。

library(rvest)
library(magrittr)

url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/'

pages <- read_html(url) %>%
  html_node(".dots + .page-numbers") %>% html_text() %>% as.integer()

就我个人而言，我会考虑循环直到没有匹配 class next 的节点，即 html_node(".next") returns 不匹配。

更丑的是

pages <- read_html(url) %>% html_nodes(".page-numbers:not(.next)") %>% tail(.,1) %>% html_text() %>% as.integer()

用 rvest 抓取网页分页。分页路径没有出现在结构中

scrape web page pagination with rvest. Pagination path does not appear in the structure

html

screen-scraping

r

rvest