如何在使用 rvest 抓取的页面中保留格式

Question

我希望抓取一个示例网页来获取歌词，我想在其中复制 Shiny 应用程序中的布局，可能在 renderUI() 函数中

People all over the world (everybody) 
Join hands (join)
Start a love train, love train
People all over the world (all the world, now)
Join hands (love ride)
Start a love train (love ride), love train

The next stop that we make will be soon (etc)

使用 rvest 我可以获得节点集和纯文本，但不清楚以原始格式显示文本的最佳方式。

library(rvest)
url <- "https://play.google.com/music/preview/Ttyni4p5vi3ohx52e7ye7m37hlm?lyrics=1&utm_source=google&utm_medium=search&utm_campaign=lyrics&pcampaignid=kp-lyrics&sa=X&ved=0ahUKEwiV7oXtqtvNAhVB5GMKHTnHDZEQr6QBCBsoADAB"

 read_html(url) %>%
   html_nodes("p")

{xml_nodeset (6)}
[1] <p>People all over the world (everybody)<br/>Join hands (join)<br/>Start         a love train, love train<br/>People all over the world (a ...
[2] <p>The next stop that we make will be soon<br/>Tell all the folks in Russia, and China, too<br/>Don't you know that it's time to g ...

read_html(url) %>%
   html_nodes("p") %>% 
   html_text()

[1] "People all over the world (everybody)Join hands (join)Start a love train, love trainPeople all over the world (all the world, now)Join hands (love ride)Start a love train (love ride), love train"                                                                                                                                                                                                            
[2] "The next stop that we make will be soonTell all the folks in Russia, and China, tooDon't you know that it's time to get on boardAnd let this train keep on riding, riding on throughWell, well"

TIA

Answer 1

你可以借用 xml2::xml_contents，其中 returns 所有子元素，包括文本和标签，都是分开的。由于 rvest 将 xml2 用于类似 read_html 的事情，该函数应该已经可用而无需显式调用 library(xml2)（但如果您愿意，请继续）。

如果您添加 purrr::map，您可以嵌套每个 <p> 标签的子标签，这样您就可以将诗句分开。如果您不喜欢另一个包，在这种情况下，它与 lapply 除了最后一个包外大部分相同，所以我在评论中添加了基本版本。

library(rvest)
library(purrr) # for `map`

url <- "https://play.google.com/music/preview/Ttyni4p5vi3ohx52e7ye7m37hlm?lyrics=1&utm_source=google&utm_medium=search&utm_campaign=lyrics&pcampaignid=kp-lyrics&sa=X&ved=0ahUKEwiV7oXtqtvNAhVB5GMKHTnHDZEQr6QBCBsoADAB"

url %>% read_html() %>% 
    html_nodes("p") %>% 
    # For each node, return all content nodes, both text and tags, separated. From xml2.
    map(xml_contents) %>%    # or lapply(xml_contents)
    # For each nexted node, get the text. Here, this just reduces "<br />" tags to "".
    map(html_text) %>%       # or lapply(html_text)
    # For each list element, subset to non-empty strings.
    map(~.x[.x != ''])       # or lapply(function(x){x[x != '']})

## [[1]]
## [1] "People all over the world (everybody)"         
## [2] "Join hands (join)"                             
## [3] "Start a love train, love train"                
## [4] "People all over the world (all the world, now)"
## [5] "Join hands (love ride)"                        
## [6] "Start a love train (love ride), love train"    
## 
## [[2]]
## [1] "The next stop that we make will be soon"             
## [2] "Tell all the folks in Russia, and China, too"        
## [3] "Don't you know that it's time to get on board"       
## [4] "And let this train keep on riding, riding on through"
## [5] "Well, well" 
## 
## ...

如何在使用 rvest 抓取的页面中保留格式

How can I retain format in a page webscraped with rvest

r

web-scraping

shiny

rvest