关于 HTML 使用 R 进行 Web 抓取的代码的混淆

Question

我在使用 R 中的 rvest 包时遇到困难，很可能是因为我对 CSS 或 HTML 缺乏了解。这是一个例子（我的猜测是“.quote-header-info”出了什么问题，也试过“.Trsdu ...”但也没有运气）：

library(rvest)
url="https://finance.yahoo.com/quote/SPY"

website=read_html(url) %>%
  html_nodes(".quote-header-info") %>%
  html_text() %>% toString()

website

下面是我要抓取的网页。专门寻找获取值“416.74”。我看了一下这里的文档 (https://cran.r-project.org/web/packages/rvest/rvest.pdf)，但我认为问题是我不明白我正在查看的网页的细分。

Answer 1

棘手的部分是确定仅 select 这个 html 节点的正确属性集。

在这种情况下，span 标签的 class 为 Trsdu(0.3s) 和 Fz(36px)

library(rvest)
url="https://finance.yahoo.com/quote/SPY"

#read page once
page <- read_html(url)

#now extract information from the page
price <- page %>%  html_nodes("span.Trsdu\(0\.3s\).Fz\(36px\)") %>%
   html_text()

price

注意：“(”、“)”和“.”都是特殊字符，因此需要对它们进行双重转义“\\”。

Answer 2

那些类是动态的，并且比 html 的其他部分更频繁地变化。应该避免使用它们。您至少有两个更强大的选择。

在脚本标签中提取 javascript 包含该数据（以及更多）的选项，然后使用 jsonlite
对其他更稳定的 html 元素使用位置匹配

我在下面展示了两者。第一个的优点是您可以从生成的 json 对象中提取大量其他页面数据。

library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)

page <- read_html('https://finance.yahoo.com/quote/SPY')

data <- page %>% 
  toString() %>% 
  stringr::str_match('root\.App\.main = (.*?[\s\S]+)(?=;[\s\S]+\(th)') %>% .[2]

json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

关于 HTML 使用 R 进行 Web 抓取的代码的混淆

Confusion Regarding HTML Code For Web Scraping With R

r

web-scraping

rvest