基于来自 rvest 的 <strong> 标签的抓取和切片数据框
Scrape and section data frame based on <strong> tags from rvest
我使用 rvest 从网站上抓取了一串文本。但是,文本被分解为未由 HTML 中的标题标签定义的部分。相反,它们只是用标签分类。
例如,标签结构看起来像
<div class="field-docs-content">
<p><strong>Title 1</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p><strong>Another Title 2</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
</div>
如果我只是通过 'field-docs-content' 在 rvest
中抓取,我会得到这样的字符串
Title 1 some sentences, some lines some sentences, some lines some sentences, some lines Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines
如果我将其转换为数据框,它将 return 一个单元格包含所有这些文本
我想要的是一个包含 2 个单元格的数据框,这样上面的字符串被标有如下标签的标题打断:
Title 1 some sentences, some lines some sentences, some lines some sentences, some lines
Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines
直接,我要找的是
- 一个数据框,其单元格在字符串的开头用
- 那些强标签“标题”下的所有 p 标签都被连接在一起而不是分开
我目前的抓取代码看起来像
webpage <- read_html(url)
data_html <- html_nodes(webpage,'.field-docs-content')
data <- html_text(data_html)
head(data)
我可以用 'strong' 替换 '.field-docs-content' 但它不会分解它下面的 p 标签中的句子。
一个很好的 URL 在野外的例子是:https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform
谢谢!
这是使用 html_nodes
隔离粗体部分 headers 的一种方法:
full <- data_html %>% html_nodes("p") %>% html_text()
headers <- data_html %>% html_nodes("strong") %>% html_text()
然后只需将文本组织成您想要的结构即可。您描述它的方式听起来像一个向量,如果您愿意,可以将其放入数据框中。这是创建向量的一种方法,其中元素被粗体 header
分解
ids <- which(full %in% headers) # starting position of section
ids2 <- ids + c(diff(ids), length(full) - tail(ids, 1) + 1) - 1 # ending position of section
vec <- rep(NA, length(ids)) # Create empty vector for destination values
for(i in 1:(length(ids))) {
vals = ids[i]:ids2[i]
vec[i] = paste(full[vals], collapse = " ")
}
一种方法是像对待其他可能用 tidyverse 解决的问题一样对待它:
library(rvest)
#> Loading required package: xml2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
url <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"
webpage <- read_html(url)
headers <-
webpage %>%
html_nodes(".field-docs-content strong") %>%
html_text()
body <- webpage %>%
html_nodes(".field-docs-content p") %>%
html_text() %>%
tibble(body_text = .)
body %>%
mutate(
headers = case_when(body_text %in% headers ~ body_text)
) %>%
tidyr::fill(headers) %>%
filter(headers != body_text) %>%
group_by(headers) %>%
summarise(body_text = paste(body_text, collapse = " "))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 14 x 2
#> headers body_text
#> <chr> <chr>
#> 1 A Leader in the World "American leadership is essential to …
#> 2 Bring Americans Together and Remove B… "Democrats believe that everyone dese…
#> 3 Combat Climate Change, Build a Clean … "Climate change is an urgent threat a…
#> 4 Confront Global Threats "Democrats will protect our country. …
#> 5 Create Good-Paying Jobs "Democrats know that nothing is more …
#> 6 Ensure the Health and Safety of All A… "Democrats have been fighting to secu…
#> 7 Fight for Economic Fairness and Again… "Democrats believe that today's extre…
#> 8 Preamble "In 2016, Democrats meet in Philadelp…
#> 9 Principled Leadership "Democrats believe that America must …
#> 10 Protect Our Values "Our values of inclusion and toleranc…
#> 11 Protect Voting Rights, Fix Our Campai… "Democrats know that Americans' right…
#> 12 Provide Quality and Affordable Educat… "Democrats know that every child, no …
#> 13 Raise Incomes and Restore Economic Se… "Democrats believe we must break down…
#> 14 Support Our Troops and Keep Faith wit… "Democrats believe America must conti…
由 reprex package (v0.3.0)
于 2020 年 7 月 21 日创建
这是一个使用 xpath 语法挑选出正确元素的解决方案,mapply
将它们放在小标题中:
library(rvest)
url <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"
page <- read_html(url)
heads <- page %>%
html_nodes(xpath = "//p/strong/parent::p") %>%
html_text()
all_p <- page %>%
html_nodes(xpath = "//p") %>%
html_text()
start <- match(heads, all_p)
end <- c(start[-1], length(all_p))
result <- as_tibble(do.call(rbind, mapply(function(a, b, h)
{
data.frame(header = h, body = paste(all_p[(a + 1):b], collapse = "\n"))
}, a = start, b = end, h = heads, SIMPLIFY = FALSE)))
这给你:
result
#> # A tibble: 15 x 2
#> header body
#> <chr> <chr>
#> 1 Preamble "In 2016, Democrats meet in Philadelphi~
#> 2 Raise Incomes and Restore Economic Securi~ "Democrats believe we must break down a~
#> 3 Create Good-Paying Jobs "Democrats know that nothing is more im~
#> 4 Fight for Economic Fairness and Against I~ "Democrats believe that today's extreme~
#> 5 Bring Americans Together and Remove Barri~ "Democrats believe that everyone deserv~
#> 6 Protect Voting Rights, Fix Our Campaign F~ "Democrats know that Americans' right t~
#> 7 Combat Climate Change, Build a Clean Ener~ "Climate change is an urgent threat and~
#> 8 Provide Quality and Affordable Education "Democrats know that every child, no ma~
#> 9 Ensure the Health and Safety of All Ameri~ "Democrats have been fighting to secure~
#> 10 Principled Leadership "Democrats believe that America must le~
#> 11 Support Our Troops and Keep Faith with Ou~ "Democrats believe America must continu~
#> 12 Confront Global Threats "Democrats will protect our country. We~
#> 13 Protect Our Values "Our values of inclusion and tolerance ~
#> 14 A Leader in the World "American leadership is essential to ke~
#> 15 The American Presidency ProjectJohn Wooll~ "Twitter Facebook\nCopyright © The Amer~
我使用 rvest 从网站上抓取了一串文本。但是,文本被分解为未由 HTML 中的标题标签定义的部分。相反,它们只是用标签分类。
例如,标签结构看起来像
<div class="field-docs-content">
<p><strong>Title 1</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p><strong>Another Title 2</strong></p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
<p> some sentences, some lines</p>
</div>
如果我只是通过 'field-docs-content' 在 rvest
中抓取,我会得到这样的字符串
Title 1 some sentences, some lines some sentences, some lines some sentences, some lines Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines
如果我将其转换为数据框,它将 return 一个单元格包含所有这些文本
我想要的是一个包含 2 个单元格的数据框,这样上面的字符串被标有如下标签的标题打断:
Title 1 some sentences, some lines some sentences, some lines some sentences, some lines
Another Title 2 some sentences, some lines some sentences, some lines some sentences, some lines
直接,我要找的是
- 一个数据框,其单元格在字符串的开头用
- 那些强标签“标题”下的所有 p 标签都被连接在一起而不是分开
我目前的抓取代码看起来像
webpage <- read_html(url)
data_html <- html_nodes(webpage,'.field-docs-content')
data <- html_text(data_html)
head(data)
我可以用 'strong' 替换 '.field-docs-content' 但它不会分解它下面的 p 标签中的句子。
一个很好的 URL 在野外的例子是:https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform
谢谢!
这是使用 html_nodes
隔离粗体部分 headers 的一种方法:
full <- data_html %>% html_nodes("p") %>% html_text()
headers <- data_html %>% html_nodes("strong") %>% html_text()
然后只需将文本组织成您想要的结构即可。您描述它的方式听起来像一个向量,如果您愿意,可以将其放入数据框中。这是创建向量的一种方法,其中元素被粗体 header
分解ids <- which(full %in% headers) # starting position of section
ids2 <- ids + c(diff(ids), length(full) - tail(ids, 1) + 1) - 1 # ending position of section
vec <- rep(NA, length(ids)) # Create empty vector for destination values
for(i in 1:(length(ids))) {
vals = ids[i]:ids2[i]
vec[i] = paste(full[vals], collapse = " ")
}
一种方法是像对待其他可能用 tidyverse 解决的问题一样对待它:
library(rvest)
#> Loading required package: xml2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
url <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"
webpage <- read_html(url)
headers <-
webpage %>%
html_nodes(".field-docs-content strong") %>%
html_text()
body <- webpage %>%
html_nodes(".field-docs-content p") %>%
html_text() %>%
tibble(body_text = .)
body %>%
mutate(
headers = case_when(body_text %in% headers ~ body_text)
) %>%
tidyr::fill(headers) %>%
filter(headers != body_text) %>%
group_by(headers) %>%
summarise(body_text = paste(body_text, collapse = " "))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 14 x 2
#> headers body_text
#> <chr> <chr>
#> 1 A Leader in the World "American leadership is essential to …
#> 2 Bring Americans Together and Remove B… "Democrats believe that everyone dese…
#> 3 Combat Climate Change, Build a Clean … "Climate change is an urgent threat a…
#> 4 Confront Global Threats "Democrats will protect our country. …
#> 5 Create Good-Paying Jobs "Democrats know that nothing is more …
#> 6 Ensure the Health and Safety of All A… "Democrats have been fighting to secu…
#> 7 Fight for Economic Fairness and Again… "Democrats believe that today's extre…
#> 8 Preamble "In 2016, Democrats meet in Philadelp…
#> 9 Principled Leadership "Democrats believe that America must …
#> 10 Protect Our Values "Our values of inclusion and toleranc…
#> 11 Protect Voting Rights, Fix Our Campai… "Democrats know that Americans' right…
#> 12 Provide Quality and Affordable Educat… "Democrats know that every child, no …
#> 13 Raise Incomes and Restore Economic Se… "Democrats believe we must break down…
#> 14 Support Our Troops and Keep Faith wit… "Democrats believe America must conti…
由 reprex package (v0.3.0)
于 2020 年 7 月 21 日创建这是一个使用 xpath 语法挑选出正确元素的解决方案,mapply
将它们放在小标题中:
library(rvest)
url <- "https://www.presidency.ucsb.edu/documents/2016-democratic-party-platform"
page <- read_html(url)
heads <- page %>%
html_nodes(xpath = "//p/strong/parent::p") %>%
html_text()
all_p <- page %>%
html_nodes(xpath = "//p") %>%
html_text()
start <- match(heads, all_p)
end <- c(start[-1], length(all_p))
result <- as_tibble(do.call(rbind, mapply(function(a, b, h)
{
data.frame(header = h, body = paste(all_p[(a + 1):b], collapse = "\n"))
}, a = start, b = end, h = heads, SIMPLIFY = FALSE)))
这给你:
result
#> # A tibble: 15 x 2
#> header body
#> <chr> <chr>
#> 1 Preamble "In 2016, Democrats meet in Philadelphi~
#> 2 Raise Incomes and Restore Economic Securi~ "Democrats believe we must break down a~
#> 3 Create Good-Paying Jobs "Democrats know that nothing is more im~
#> 4 Fight for Economic Fairness and Against I~ "Democrats believe that today's extreme~
#> 5 Bring Americans Together and Remove Barri~ "Democrats believe that everyone deserv~
#> 6 Protect Voting Rights, Fix Our Campaign F~ "Democrats know that Americans' right t~
#> 7 Combat Climate Change, Build a Clean Ener~ "Climate change is an urgent threat and~
#> 8 Provide Quality and Affordable Education "Democrats know that every child, no ma~
#> 9 Ensure the Health and Safety of All Ameri~ "Democrats have been fighting to secure~
#> 10 Principled Leadership "Democrats believe that America must le~
#> 11 Support Our Troops and Keep Faith with Ou~ "Democrats believe America must continu~
#> 12 Confront Global Threats "Democrats will protect our country. We~
#> 13 Protect Our Values "Our values of inclusion and tolerance ~
#> 14 A Leader in the World "American leadership is essential to ke~
#> 15 The American Presidency ProjectJohn Wooll~ "Twitter Facebook\nCopyright © The Amer~