在 R 中从维基百科中抓取多个表格

Question

我正在尝试使用 R 中的 rvest 库来抓取此 Wiki 页面的内容

(https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019)

我想提取 4 个表格，其中包含 2019 年（1 月至 3 月、4 月至 6 月、7 月至 9 月、10 月至 12 月）上映的宝莱坞电影的数据

已经完成

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")

#Then I match with the word opening & I get 4 tables as in wikipedia page, however I am struggling to combine them into one dataframe & store it 

tbls[grep("Opening",tbls,ignore.case = T)]

这给出了错误

df <- html_table(tbls[grep("Opening",tbls,ignore.case = T)],fill = T)

我明白了，因为它返回了多个表，我不知道在哪里遗漏了一些下标。求助！

Answer 1

对于复杂的HTML表，我推荐htmltab包：

library(purrr)
library(htmltab)

url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
tbls <- map2(url, 4:7, htmltab)
tbls <- do.call(rbind, tbls)

Answer 2

这是适合您的一种方法，但我相信有更好的方法来处理您的案件。当你使用rvest包时，你可以使用SelectGadget。您看到 link 中有 15 个表。首先，您想抓取所有表并创建一个列表对象。然后，您想要使用列信息对列表进行子集化。您要抓取的表将 Opening 作为列名。所以我使用了逻辑检查来测试每个列表元素中是否有一个具有该名称的列，并得到了你想要的四个表。

library(tidyverse)
library(htmltab)

map(.x = 1:15,
    .f = function(mynum) {htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019",
                                  which = mynum, rm_nodata_cols = F)}) -> res

Filter(function(x) any(names(x) %in% "Opening"), res) -> out

在 R 中从维基百科中抓取多个表格

Scrape multiple tables from Wikipedia in R

r

web-scraping

rvest

tidyverse

这给出了错误