如何使用 rvest 和 xpath 抓取 table？

Question

使用以下 documentation 我一直在尝试从 marketwatch.com

中抓取一系列表格

这里是用下面的代码表示的：

代码中已经包含link和xpath:

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>%
  html_table()
valuation <- valuation[[1]]

我收到以下错误：

Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

提前致谢。

Answer 1

该网站未使用 html table，因此 html_table() 找不到任何内容。它实际使用 div 类 column 和 data lastcolumn.

所以你可以做类似的事情

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="column"]')
    
valuation_data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="data lastcolumn"]')

甚至

url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="section"]')

让你完成大部分工作。

请同时阅读他们的 terms of use - 特别是 3.4.

如何使用 rvest 和 xpath 抓取 table？

How to scrape a table with rvest and xpath?

xpath

r

web-scraping

rvest