使用 rvest 将 Webscrape 标题和列表添加到数据框
Webscrape heading and lists to a dataframe with rvest
我想将超链接 on this webpage 抓取到具有如下所示列的数据框中。源页面包含标题和链接列表。
- subject.heading(问题)
- hyperlink.title(确定)
- 超链接(确定)
获取链接和标题很简单(html_node
"li" 和 "a")。我不清楚如何将主题标题合并到最终数据框中。
library(tidyverse)
library(rvest)
my.url <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_nodes("#sharePointMainContent")
hyperlink.title <- my.url %>%
html_nodes("li") %>%
html_text()
hyperlink <- my.url %>%
html_nodes("li") %>%
html_nodes("a") %>%
html_attr("href")
df <- tibble(title, hyperlink.title)
我可以成功抓取标题,但无法弄清楚如何将它们正确地合并到最终数据框中。
subject.heading <- my.url %>%
html_nodes("h3") %>%
html_text() %>% str_trim()
由 reprex package (v0.2.0) 创建于 2018-09-03。
该页面的结构很奇怪,table 位于主 table 内。
我发现有效的方法是迭代 (map_df()
) 父 table 的单元格(由 s4-wpcell-plain
class 标识)。每个单元格包含另一个 table,但我们可以简单地提取我们想要的内容,而不是依赖 html_table()
.
library(tidyverse)
library(rvest)
#> Loading required package: xml2
r <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_node("#sharePointMainContent>div>table") %>%
html_nodes(".s4-wpcell-plain") %>%
map_df(~{
heading <- .x %>% html_nodes('h3') %>% html_text() %>% str_trim()
titles <- .x %>% html_nodes('li') %>% html_text()
links <- .x %>% html_nodes('a') %>% html_attr("href")
data_frame(heading, titles, links)
})
r
#> # A tibble: 21 x 3
#> heading titles links
#> <chr> <chr> <chr>
#> 1 DEPARTMENT OF THE NAVY SUMMARY FY 19 DON Press Brief http://www.secna…
#> 2 DEPARTMENT OF THE NAVY SUMMARY Supporting Exhibits http://www.secna…
#> 3 DEPARTMENT OF THE NAVY SUMMARY Budget Highlights Book http://www.secna…
#> 4 DEPARTMENT OF THE NAVY SUMMARY The Bottom Line http://www.secna…
#> 5 DEPARTMENT OF THE NAVY SUMMARY Report to Congress on… http://www.secna…
#> 6 DEPARTMENT OF THE NAVY SUMMARY Ship Building Plan SE… http://www.secna…
#> 7 MILITARY PERSONNEL PROGRAMS Military Personnel, N… http://www.secna…
#> 8 MILITARY PERSONNEL PROGRAMS Military Personnel, M… http://www.secna…
#> 9 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Na… http://www.secna…
#> 10 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Ma… http://www.secna…
#> # ... with 11 more rows
由 reprex package (v0.2.0) 创建于 2018-09-04。
我想将超链接 on this webpage 抓取到具有如下所示列的数据框中。源页面包含标题和链接列表。
- subject.heading(问题)
- hyperlink.title(确定)
- 超链接(确定)
获取链接和标题很简单(html_node
"li" 和 "a")。我不清楚如何将主题标题合并到最终数据框中。
library(tidyverse)
library(rvest)
my.url <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_nodes("#sharePointMainContent")
hyperlink.title <- my.url %>%
html_nodes("li") %>%
html_text()
hyperlink <- my.url %>%
html_nodes("li") %>%
html_nodes("a") %>%
html_attr("href")
df <- tibble(title, hyperlink.title)
我可以成功抓取标题,但无法弄清楚如何将它们正确地合并到最终数据框中。
subject.heading <- my.url %>%
html_nodes("h3") %>%
html_text() %>% str_trim()
由 reprex package (v0.2.0) 创建于 2018-09-03。
该页面的结构很奇怪,table 位于主 table 内。
我发现有效的方法是迭代 (map_df()
) 父 table 的单元格(由 s4-wpcell-plain
class 标识)。每个单元格包含另一个 table,但我们可以简单地提取我们想要的内容,而不是依赖 html_table()
.
library(tidyverse)
library(rvest)
#> Loading required package: xml2
r <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_node("#sharePointMainContent>div>table") %>%
html_nodes(".s4-wpcell-plain") %>%
map_df(~{
heading <- .x %>% html_nodes('h3') %>% html_text() %>% str_trim()
titles <- .x %>% html_nodes('li') %>% html_text()
links <- .x %>% html_nodes('a') %>% html_attr("href")
data_frame(heading, titles, links)
})
r
#> # A tibble: 21 x 3
#> heading titles links
#> <chr> <chr> <chr>
#> 1 DEPARTMENT OF THE NAVY SUMMARY FY 19 DON Press Brief http://www.secna…
#> 2 DEPARTMENT OF THE NAVY SUMMARY Supporting Exhibits http://www.secna…
#> 3 DEPARTMENT OF THE NAVY SUMMARY Budget Highlights Book http://www.secna…
#> 4 DEPARTMENT OF THE NAVY SUMMARY The Bottom Line http://www.secna…
#> 5 DEPARTMENT OF THE NAVY SUMMARY Report to Congress on… http://www.secna…
#> 6 DEPARTMENT OF THE NAVY SUMMARY Ship Building Plan SE… http://www.secna…
#> 7 MILITARY PERSONNEL PROGRAMS Military Personnel, N… http://www.secna…
#> 8 MILITARY PERSONNEL PROGRAMS Military Personnel, M… http://www.secna…
#> 9 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Na… http://www.secna…
#> 10 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Ma… http://www.secna…
#> # ... with 11 more rows
由 reprex package (v0.2.0) 创建于 2018-09-04。