使用 rvest 从 html table 中检索 link

Question

我想抓取一个存储德国自行车比赛结果的网站，但我很难获得指向比赛结果的网址。 Website with result table

这是我目前得到的结果，对我来说 html table 的格式似乎也很奇怪，但这也可能是由于我缺乏 html 知识：

library(tidyverse)
library(magrittr)
library(rvest)

#read html
result_url <- "https://www.rad-net.de/rad-net-ergebnisse.htm?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1"
results <- read_html(result_url)
#extract date, race name
results %>%
  html_table(header = T, fill = T) %>% 
  extract2(8) %>% 
  tibble()
#> # A tibble: 40 x 2
#>    Datum         Veranstaltungstitel                                            
#>    <chr>         <chr>                                                          
#>  1 So, 19.07.20… "5. Rosenheimer Jugend - Kriterium"                            
#>  2 So, 12.07.20… "Swiss O Par Preis"                                            
#>  3 So, 12.07.20… "Deutsche Meisterschaft Einzelzeitfahren U19m/w"               
#>  4 So, 12.07.20… "Jugendrenntag der RV Offenbach"                               
#>  5 Sa, 04.07.20… "CoronaChronoNRW"                                              
#>  6 Sa, 20.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  7 Sa, 13.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  8 Sa, 06.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  9 So, 31.05.20… "Westsachsenklassiker - 72. Sachsenringradrennen"              
#> 10 So, 08.03.20… "8. Herforder Frühjahrspreis"                                  
#> # … with 30 more rows

^{由 reprex package (v0.3.0)}

于 2020-07-25 创建

Answer 1

我认为您要查找的信息比 html_table 函数通常提供的信息要多一些（无论如何，页面上实际上有几个嵌套的 html 表格）。我想这就是您要找的：

library(tidyverse)
library(magrittr)
library(rvest)

results <- paste0("https://www.rad-net.de/rad-net-ergebnisse.htm",
                  "?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1") %>%
             read_html()

link_nodes <- results %>% html_nodes(xpath = "//table//a")  
link_text  <- link_nodes %>% html_text()
index <- (which(link_text == "hier") + 1):(which(link_text == "N\u00e4chste") - 1)
link_nodes <- link_nodes[index]
dates <- link_nodes %>% 
          html_nodes(xpath = "//table//a/parent::td/preceding-sibling::td/font") %>%
          html_text()
df <- tibble(Datum = dates[-1], 
             Veranstaltungstitel = link_nodes %>% html_text(),
             link = link_nodes %>% html_attr("href"))

df
#> # A tibble: 40 x 3
#>    Datum     Veranstaltungstitel                   link                         
#>    <chr>     <chr>                                 <chr>                        
#>  1 So, 19.0~ "5. Rosenheimer Jugend - Kriterium"   /rad-net-portal/rad-net-erge~
#>  2 So, 12.0~ "Swiss O Par Preis"                   /rad-net-portal/rad-net-erge~
#>  3 So, 12.0~ "Deutsche Meisterschaft Einzelzeitfa~ /rad-net-portal/rad-net-erge~
#>  4 So, 12.0~ "Jugendrenntag der RV Offenbach"      /rad-net-portal/rad-net-erge~
#>  5 Sa, 04.0~ "CoronaChronoNRW"                     /rad-net-portal/rad-net-erge~
#>  6 Sa, 20.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  7 Sa, 13.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  8 Sa, 06.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  9 So, 31.0~ "Westsachsenklassiker - 72. Sachsenr~ /rad-net-portal/rad-net-erge~
#> 10 So, 08.0~ "8. Herforder Frühjahrspreis"         /rad-net-portal/rad-net-erge~
#> # ... with 30 more rows

^{由 reprex package (v0.3.0)}

于 2020-07-25 创建

使用 rvest 从 html table 中检索 link

Retrieve link from html table with rvest

r

web-scraping

rvest