R 中的 Web 抓取,访问 html 个节点

Webscraping in R, accessing html nodes

rvest 包的直接应用:我正在尝试从站点抓取 class 个 html 个链接。

此代码为我提供了站点中的正确节点:

library(rvest)
library(magrittr)

foo <- "http://www.realclearpolitics.com/epolls/2010/house/2010_elections_house_map.html" %>% 
            read_html

另外,我使用 css 选择器定位了正确的节点:

foo %>% 
  html_nodes("#states td") %>% 
  extract(2:4)

哪个returns

{xml_nodeset (3)}
[1] <td>\n  <a class="dem" href="/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html">\n    <span>AR4</span>\n  </a>\n</td>
[2] <td>\n  <a class="dem" href="/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html">\n    <span>CT1</span>\n  </a>\n</td>
[3] <td>\n  <a class="dem" href="/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html">\n    <span>CT2</span>\n  </a>\n</td>

好的,href 属性正是我要找的。但是这个

foo %>% 
  html_nodes("#states td") %>% 
  extract(2:4) %>% 
  html_attr("href")

returns

[1] NA NA NA

如何访问底层链接?

使用xml_children(),你可以:

foo %>% 
  html_nodes('#states td') %>% 
  xml_children %>%
  html_attr('href') %>%
  extract(2:4)

Returns:

[1] "/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html"            
[2] "/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html"     
[3] "/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html"

您可以将 extract 放在 html_attr 前面,可能其他一些序列也可以。