使用 Rvest、Lapply、Rbind 和 Selector Gadget 进行网页抓取

Question

我是网络抓取的新手，在抓取网站中的多个页面时遇到困难。感谢社区的任何反馈！

目标：从 2019 年开始抓取每个团队的统计数据cfbstats.com，并将每个团队的数据放入一个数据框或自定义团队数据框中。

操作： 在下面的代码中，我编写了一个 html 调用，将每个团队的（小示例）唯一标识符粘贴到 html 代码中。之后，我使用 lapply 函数和 rvest 来提取每个团队的数据。最后，rbind 和 do.call 用于连接所有的拉动。

问题： 在 @Dave2e 的帮助下，我意识到 rbind 结合了所有被抓取的数据表时遇到了问题。从网络抓取中加入数据表有什么建议吗？ match.names(clabs, names(xi)) 中的错误：名称与以前的名称不匹配

library(tidyverse)
library(rvest)

team_id <- c(721, 5, 8)
teams <- paste('http://cfbstats.com/2019/team/', team_id, '/index.html', sep = "")

df_team_stats <- lapply(teams, function(i){
  webpage <- read_html(i)
  team_table <- html_nodes(webpage, '.team-statistics')
  overall_stats <- html_table(team_table)[[1]]
})

finaldf <- do.call(rbind, df_team_stats)

Answer 1

这里有几个问题。首先，您的 lapply 不是 return 数据框列表，而是列表列表（只是每个子列表都有一个数据框作为其唯一内容）。

其次，当您将数据框绑定在一起时，它们需要具有相同的列名。在您的例子中，团队的名称充当列名称。如果只是覆盖这个，你将不知道一行是指哪个团队的数据，所以你需要添加一个“团队”列并将其他列的名称标准化：

library(dplyr)
library(rvest)

team_id <- c(721, 5, 8)
teams   <- paste('http://cfbstats.com/2019/team/', team_id, '/index.html', sep = "")

df_team_stats <- lapply(teams, function(i){
  webpage   <- read_html(i)
  all_stats <- html_nodes(webpage, xpath = "//table[@class='team-statistics']") %>%
               html_table() %>% `[[`(1)
  all_stats$team <- rep(names(all_stats)[2], nrow(all_stats))
  names(all_stats)[1:2] <- c("stat", "home")
  all_stats[,c(4, 1:3)]
})

as_tibble(bind_rows(df_team_stats))
#> # A tibble: 99 x 4
#>    team     stat                                   home          Opponents      
#>    <chr>    <chr>                                  <chr>         <chr>          
#>  1 Air For~ Scoring:  Points/Game                  34.1          19.8           
#>  2 Air For~ Scoring:  Games - Points               13 - 443      13 - 258       
#>  3 Air For~ First Downs:  Total                    286           216            
#>  4 Air For~ First Downs:  Rushing - Passing - By ~ 227 - 52 - 7  77 - 131 - 8   
#>  5 Air For~ Rushing:  Yards / Attempt              5.14          3.49           
#>  6 Air For~ Rushing:  Attempts - Yards - TD        755 - 3881 -~ 375 - 1307 - 11
#>  7 Air For~ Passing:  Rating                       187.92        141.25         
#>  8 Air For~ Passing:  Yards                        1602          2848           
#>  9 Air For~ Passing:  Attempts - Completions - In~ 126 - 68 - 6~ 377 - 238 - 7 ~
#> 10 Air For~ Total Offense:  Yards / Play           6.22          5.53           
#> # ... with 89 more rows

^{由 reprex package (v0.3.0)}

于 2020-06-27 创建

Answer 2

也许 dplyr 的 bind_cols() 在这里很有用。所以像：

library(tidyverse)
library(janitor)
library(rvest)

team_id <- c(721, 5, 8)
teams <- paste('http://cfbstats.com/2019/team/', team_id, '/index.html', sep = "")


df_team_stats <- lapply(teams, function(i){
  webpage <- read_html(i)
  team_table <- html_nodes(webpage, 'table.team-statistics')
  overall_stats <- html_table(team_table)[[1]] %>% 
    clean_names() %>% 
    column_to_rownames(var = "x")
})

finaldf <- df_team_stats %>% 
  bind_cols() %>% 
  rownames_to_column(var = "Statistics")

使用 Rvest、Lapply、Rbind 和 Selector Gadget 进行网页抓取

Web Scraping with Rvest, Lapply, Rbind, and Selector Gadget

r

lapply

web-scraping

rvest