rvest 从网页上抓取链接
rvest scrape links from webpage
我正在使用 rvest
从杂志 'The Hustle' 中抓取一些链接。我用过这个代码
library(rvest)
page <- read_html("https://thehustle.co/daily/page/33/") %>%
html_nodes(".daily-article-title") %>%
html_attr('href')
然而,这 returns 是一个包含 30 个 NA 的向量。我使用 SelectorGadget 找到了 class,所以不确定这里出了什么问题。
链接位于 class '.daily-article-title'
上方。这是获取标题和相应链接的方法。
library(rvest)
webpage <- read_html("https://thehustle.co/daily/page/33/")
webpage %>%
html_nodes("h3.daily-article-title") %>%
html_text() -> title
title
# [1] "\nApple buys itself a 0m Christmas present\n"
# [2] "\nSan Francisco wages war on robots\n"
# [3] "\nThe US could lose its greatest export\n"
# [4] "\n\"Mom, where do podcasts come from?\"\n"
# [5] "\nTencent Music to team up with Spotify?\n"
# [6] "\nFirst rule of the Farmers Business Network?\n"
# [7] "\nSpiegel goes HAM on social media\n"
# [8] "\nBanks won't take weed companies’ cash\n"
# [9] "\nThe Koch bros just took a 0m stake in Time\n"
#[10] "\n4 mins to smarter Monday smalltalk\n"
#...
#...
webpage %>%
html_nodes("[class='col-md-12 daily-wrap clearfix'] a") %>%
html_attr('href') -> link
# [1] "https://thehustle.co/apple-christmas-present"
# [2] "https://thehustle.co/war-on-robots"
# [3] "https://thehustle.co/big-data-trade-nafta-daily"
# [4] "https://thehustle.co/apple-podcast-market"
# [5] "https://thehustle.co/tencent-spotify-truce"
# [6] "https://thehustle.co/first-rule-of-farmers"
# [7] "https://thehustle.co/snap-anti-facebook"
# [8] "https://thehustle.co/weed-banking"
# [9] "https://thehustle.co/pepshi-bros"
#[10] "https://thehustle.co/rundown"
#...
#...
我正在使用 rvest
从杂志 'The Hustle' 中抓取一些链接。我用过这个代码
library(rvest)
page <- read_html("https://thehustle.co/daily/page/33/") %>%
html_nodes(".daily-article-title") %>%
html_attr('href')
然而,这 returns 是一个包含 30 个 NA 的向量。我使用 SelectorGadget 找到了 class,所以不确定这里出了什么问题。
链接位于 class '.daily-article-title'
上方。这是获取标题和相应链接的方法。
library(rvest)
webpage <- read_html("https://thehustle.co/daily/page/33/")
webpage %>%
html_nodes("h3.daily-article-title") %>%
html_text() -> title
title
# [1] "\nApple buys itself a 0m Christmas present\n"
# [2] "\nSan Francisco wages war on robots\n"
# [3] "\nThe US could lose its greatest export\n"
# [4] "\n\"Mom, where do podcasts come from?\"\n"
# [5] "\nTencent Music to team up with Spotify?\n"
# [6] "\nFirst rule of the Farmers Business Network?\n"
# [7] "\nSpiegel goes HAM on social media\n"
# [8] "\nBanks won't take weed companies’ cash\n"
# [9] "\nThe Koch bros just took a 0m stake in Time\n"
#[10] "\n4 mins to smarter Monday smalltalk\n"
#...
#...
webpage %>%
html_nodes("[class='col-md-12 daily-wrap clearfix'] a") %>%
html_attr('href') -> link
# [1] "https://thehustle.co/apple-christmas-present"
# [2] "https://thehustle.co/war-on-robots"
# [3] "https://thehustle.co/big-data-trade-nafta-daily"
# [4] "https://thehustle.co/apple-podcast-market"
# [5] "https://thehustle.co/tencent-spotify-truce"
# [6] "https://thehustle.co/first-rule-of-farmers"
# [7] "https://thehustle.co/snap-anti-facebook"
# [8] "https://thehustle.co/weed-banking"
# [9] "https://thehustle.co/pepshi-bros"
#[10] "https://thehustle.co/rundown"
#...
#...