Table 来自 url 使用 rvest 包

Question

我想在三个 table 秒内从网站获取信息。我尝试应用下面的代码，但 table 的格式令人困惑。

url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>% html_table(fill = TRUE)

观察：tidyverse 和 rvest 已被使用

Answer 1

您需要清理一下 table。

library(rvest)
library(dplyr)

url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'

url %>% 
  read_html %>% 
  html_table(fill = TRUE) %>%
  .[[1]] %>%
  .[complete.cases(.),] %>%
  mutate_all(~gsub('\n|\s{2,}', '', .))

#   W/L                 Fighter Str Td Sub Pass
#1 loss Tom AaronMatt Ricehouse  00 00  00   00
#2  win Tom AaronEric Steenberg  00 00  00   00

#                                            Event              Method Round Time
#1 Strikeforce - Henderson vs. BabaluDec. 04, 2010               U-DEC     3 5:00
#2      Strikeforce - Heavy ArtilleryMay. 15, 2010 SUBGuillotine Choke     1 0:56

Answer 2

您正在使用的 table 很棘手，因为有 table 个单元格（HTML 中的 <td> 个元素）跨越两行以重复信息.当 html_table 删除信息时，这些单独的行会连接起来，您会得到一长串空格和换行符。

library(dplyr)
library(rvest)

ufc <- read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9") %>%
  html_table(fill = TRUE) %>%
  .[[1]] %>%
  filter(!is.na(Fighter)) # could instead use janitor::remove_empty or rowSums for number of NAs

ufc$Fighter[1]
#> [1] "Tom Aaron\n          \n        \n\n        \n          \n            Matt Ricehouse"

使用一些正则表达式，您可以将这些空白作为分隔符来拆分单元格。适用于两行的信息（例如时间）会重复。最初我是用 mutate_all 这样做的，但意识到 Event 不应该 被拆分——为此，只需删除多余的空格。根据需要调整其他列。

ufc %>%
  mutate_at(vars(Fighter:Pass), stringr::str_replace_all, "\s{2,}", "|") %>%
  mutate_all(stringr::str_replace_all, "\s{2,}", " ") %>%
  tidyr::separate_rows(everything(), sep = "\|")
#>    W/L        Fighter Str Td Sub Pass
#> 1 loss      Tom Aaron   0  0   0    0
#> 2 loss Matt Ricehouse   0  0   0    0
#> 3  win      Tom Aaron   0  0   0    0
#> 4  win Eric Steenberg   0  0   0    0
#>                                              Event               Method Round
#> 1 Strikeforce - Henderson vs. Babalu Dec. 04, 2010                U-DEC     3
#> 2 Strikeforce - Henderson vs. Babalu Dec. 04, 2010                U-DEC     3
#> 3      Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke     1
#> 4      Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke     1
#>   Time
#> 1 5:00
#> 2 5:00
#> 3 0:56
#> 4 0:56

Table 来自 url 使用 rvest 包

Table from url using rvest package

r

rvest