为什么 purrr 包的 map 函数没有抓取所有的 urls 数据？

Question

我正在尝试从网站上抓取一些艺术家的歌词，以便以后由艺术家做一些词云。生成这些 url 是为了使用 purrr map 函数从中抓取每一句歌词。代码运行但过了一会儿只返回一位艺术家的歌词。我需要做什么来抓取所有歌词并将它们存储在一个对象中？

代码如下：

##=----------------------------------------------INSTALL PACKAGES---------------------------------------

#install.packages("tidyverse")

##=----------------------------------------------LIBRARIES----------------------------------------------

library(rvest)
library(stringr)
library(purrr)

##=----------------------------------------------FUNCTIONS----------------------------------------------

hash<-function(x)
{
  x<-read_html(x)%>%
    html_nodes("cnt-letra p402_premium, p")%>%
    html_text()
  x<-str_remove_all(x,"[:punct:]")
  x<-tolower(x)
  x<-iconv(x,to ="ASCII//TRANSLIT")
  x<-str_remove_all(x,"'")
}

##=----------------------------------------------MAIN CODE----------------------------------------------

url<-"https://www.letras.com/mais-acessadas/reggaeton/"

##url hashing
song<-read_html(url)%>%
  html_nodes("b")%>%
  html_text()

##url hashing
artist<-read_html(url)%>%
  html_nodes("li a span")%>%
  html_text()

#Strings Cleaning
artist_clean<-str_remove_all(artist,"[:punct:]")
artist_clean<-tolower(artist_clean)
artist_clean<-iconv(artist_clean,to ="ASCII//TRANSLIT")
artist_clean<-str_remove_all(artist_clean,"'")
artist_clean<-gsub(" ","-",artist_clean)


#Strings Cleaning
song_clean<-str_remove_all(song,"[:punct:]")
song_clean<-tolower(song_clean)
song_clean<-iconv(song_clean,to ="ASCII//TRANSLIT")
song_clean<-str_remove_all(song_clean,"'")
song_clean<-gsub(" ","-",song_clean)

home<-"https://letras.com"

##url generation
generated_urls<-paste(home, "/", artist_clean,"/", song_clean, sep = "")
generated_urls<-generated_urls[1:5]

x<-purrr::map(generated_urls,hash)

Answer 1

我不太清楚为什么它会重复同一个，但是如果你在运行映射之前将 url 作为名称传递，它会产生预期的输出：

generated_urls[1:5] %>%
  set_names() %>% 
  map(hash)

然后您可以通过url或索引访问歌词，无论如何这可能更有用。解决这个问题的另一种方法是将 url 设置为 tibble 中的一列并使用 tibble(url = generated_urls) %>% mutate(lyrics = map(generated_url)) 等。

为什么 purrr 包的 map 函数没有抓取所有的 urls 数据？

Why map function of purrr package didnt scrape all urls data?

r

web-scraping

rvest

purrr

tidyverse