基于来自 rvest 的 <strong> 标签的抓取和切片数据框

Question

我使用 rvest 从网站上抓取了一串文本。但是，文本被分解为未由 HTML 中的标题标签定义的部分。相反，它们只是用标签分类。

例如，标签结构看起来像

<div class="field-docs-content">
<p><strong>Title 1</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p><strong>Another Title 2</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
</div>

如果我只是通过 'field-docs-content' 在 rvest 中抓取，我会得到这样的字符串

Title 1 some sentences, some lines some sentences, some lines some sentences, some lines Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines

如果我将其转换为数据框，它将 return 一个单元格包含所有这些文本

我想要的是一个包含 2 个单元格的数据框，这样上面的字符串被标有如下标签的标题打断：

Title 1 some sentences, some lines some sentences, some lines some sentences, some lines 
Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines

直接，我要找的是

一个数据框，其单元格在字符串的开头用
那些强标签“标题”下的所有 p 标签都被连接在一起而不是分开

我目前的抓取代码看起来像

webpage <- read_html(url)
data_html <- html_nodes(webpage,'.field-docs-content') 
data <- html_text(data_html)
head(data)

我可以用 'strong' 替换 '.field-docs-content' 但它不会分解它下面的 p 标签中的句子。

一个很好的 URL 在野外的例子是：https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform

谢谢！

Answer 1

这是使用 html_nodes 隔离粗体部分 headers 的一种方法：

full <- data_html %>% html_nodes("p") %>% html_text()

headers <- data_html %>% html_nodes("strong") %>% html_text()

然后只需将文本组织成您想要的结构即可。您描述它的方式听起来像一个向量，如果您愿意，可以将其放入数据框中。这是创建向量的一种方法，其中元素被粗体 header

分解

ids <- which(full %in% headers) # starting position of section

ids2 <- ids + c(diff(ids), length(full) - tail(ids, 1) + 1) - 1 # ending position of section

vec <- rep(NA, length(ids)) # Create empty vector for destination values
for(i in 1:(length(ids))) {
  vals = ids[i]:ids2[i]
  vec[i] = paste(full[vals], collapse = " ")
}

Answer 2

一种方法是像对待其他可能用 tidyverse 解决的问题一样对待它：

  
library(rvest)
#> Loading required package: xml2
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

url <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"
webpage <- read_html(url)

headers <- 
webpage %>% 
  html_nodes(".field-docs-content strong") %>% 
  html_text()

body <- webpage %>% 
  html_nodes(".field-docs-content p") %>% 
  html_text() %>% 
  tibble(body_text = .)

body %>%
  mutate(
    headers = case_when(body_text %in% headers ~ body_text)
    ) %>% 
  tidyr::fill(headers) %>% 
  filter(headers != body_text) %>% 
  group_by(headers) %>% 
  summarise(body_text = paste(body_text, collapse = " "))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 14 x 2
#>    headers                                body_text                             
#>    <chr>                                  <chr>                                 
#>  1 A Leader in the World                  "American leadership is essential to …
#>  2 Bring Americans Together and Remove B… "Democrats believe that everyone dese…
#>  3 Combat Climate Change, Build a Clean … "Climate change is an urgent threat a…
#>  4 Confront Global Threats                "Democrats will protect our country. …
#>  5 Create Good-Paying Jobs                "Democrats know that nothing is more …
#>  6 Ensure the Health and Safety of All A… "Democrats have been fighting to secu…
#>  7 Fight for Economic Fairness and Again… "Democrats believe that today's extre…
#>  8 Preamble                               "In 2016, Democrats meet in Philadelp…
#>  9 Principled Leadership                  "Democrats believe that America must …
#> 10 Protect Our Values                     "Our values of inclusion and toleranc…
#> 11 Protect Voting Rights, Fix Our Campai… "Democrats know that Americans' right…
#> 12 Provide Quality and Affordable Educat… "Democrats know that every child, no …
#> 13 Raise Incomes and Restore Economic Se… "Democrats believe we must break down…
#> 14 Support Our Troops and Keep Faith wit… "Democrats believe America must conti…

^{由 reprex package (v0.3.0)}

于 2020 年 7 月 21 日创建

Answer 3

这是一个使用 xpath 语法挑选出正确元素的解决方案，mapply 将它们放在小标题中：

library(rvest)

url   <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"

page  <-  read_html(url) 

heads <-  page %>%
          html_nodes(xpath = "//p/strong/parent::p") %>% 
          html_text()

all_p <-  page %>%
          html_nodes(xpath = "//p") %>% 
          html_text()

start <-  match(heads, all_p)
end   <-  c(start[-1], length(all_p))

result <- as_tibble(do.call(rbind, mapply(function(a, b, h) 
          {
            data.frame(header = h, body = paste(all_p[(a + 1):b], collapse = "\n"))
          }, a = start, b = end, h = heads, SIMPLIFY = FALSE)))

这给你：

result
#> # A tibble: 15 x 2
#>    header                                     body                                    
#>    <chr>                                      <chr>                                   
#>  1 Preamble                                   "In 2016, Democrats meet in Philadelphi~
#>  2 Raise Incomes and Restore Economic Securi~ "Democrats believe we must break down a~
#>  3 Create Good-Paying Jobs                    "Democrats know that nothing is more im~
#>  4 Fight for Economic Fairness and Against I~ "Democrats believe that today's extreme~
#>  5 Bring Americans Together and Remove Barri~ "Democrats believe that everyone deserv~
#>  6 Protect Voting Rights, Fix Our Campaign F~ "Democrats know that Americans' right t~
#>  7 Combat Climate Change, Build a Clean Ener~ "Climate change is an urgent threat and~
#>  8 Provide Quality and Affordable Education   "Democrats know that every child, no ma~
#>  9 Ensure the Health and Safety of All Ameri~ "Democrats have been fighting to secure~
#> 10 Principled Leadership                      "Democrats believe that America must le~
#> 11 Support Our Troops and Keep Faith with Ou~ "Democrats believe America must continu~
#> 12 Confront Global Threats                    "Democrats will protect our country. We~
#> 13 Protect Our Values                         "Our values of inclusion and tolerance ~
#> 14 A Leader in the World                      "American leadership is essential to ke~
#> 15 The American Presidency ProjectJohn Wooll~ "Twitter Facebook\nCopyright © The Amer~

基于来自 rvest 的 <strong> 标签的抓取和切片数据框

Scrape and section data frame based on <strong> tags from rvest

html

r

web-scraping

rvest