使用 rvest 使用标题名称抓取特定 html table

Question

正在尝试从特定 table 的建筑许可信息中抓取数据。以下代码适用于我正在遍历的大部分建筑许可：

library(rvest)

permit_numbers <- c("BP125602", "BP125473", "BP125472")

URL <- paste("https://www.nanaimo.ca/WhatsBuilding/Folder", permit_numbers, sep = "/")

task_table <- lapply(URL, function(x) {
    x %>%
    read_html() %>%
    html_table() %>%
    .[[3]] %>%
    .[["Task"]]
})

但是有时任务信息不在页面的第三个 table 中。例如，https://www.nanaimo.ca/WhatsBuilding/Folder/BP125721 任务信息在第二个table。

如何抓取列标题为 "Task" 的信息，而不管它在页面上的什么位置？

Answer 1

这应该有效：

library(rvest)
URL <- "https://www.nanaimo.ca/WhatsBuilding/Folder/BP125602"

tables <- URL %>%
  read_html() %>%
  html_table()


task_table <- lapply(tables, function(x) if(names(x) == "Task"){x})

task_table[sapply(task_table, is.null)] <- NULL
task_table <- task_table[[1]][["Task"]]

这是您要找的吗？

使用 rvest 使用标题名称抓取特定 html table

Using rvest to scrape specific html table using the heading name

r

web-scraping

rvest