R 从雅虎财经抓取 HTML table

Question

我想从 Yahoo Finance 抓取 table 并将其下载为数据框。不幸的是，我真的不知道如何使用 rvest-package.

这是第一种方法：

library(tidyverse)
library(rvest)

url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"

url %>%
  html() %>%
  html_nodes(xpath="table") %>%
  html_table()

如预期的那样，代码不起作用。有人可以帮助我吗？

我想将框架 table 作为数据框：

非常感谢！

Answer 1

不幸的是，使用 html_table 并不容易提取 table table。这是一种从 table 中提取单个值并进行一些 post 处理以获取数据帧中的数据的方法。

library(rvest)

url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"

url %>%
  read_html() %>%
  html_nodes('table') %>%
  .[[1]] -> tab1
header <- tab1 %>% html_nodes('th') %>% html_text()

result <- tab1%>%
  html_nodes('tr.simpTblRow td') %>%
  html_text() %>%
  matrix(ncol = 9, byrow = TRUE) %>%
  as.data.frame()
names(result) <- header

result

#    Symbol                                     Company Exchange
#1    VELOU            Velocity Acquisition Corp. Units   Nasdaq
#2    FTAAU          FTAC Athena Acquisition Corp. Unit   Nasdaq
#3    CMIIU               CM Life Sciences II Inc. Unit   Nasdaq
#4                                       Metropress Ltd      LSE
#5 CTWO.P.V                        County Capital 2 Ltd     TSXV
#6    GSEVU              Gores Holdings VII, Inc. Units   Nasdaq
#7     NVOS Novo Integrated Sciences, Inc. Common Stock   Nasdaq
#8    SLAMU                             Slam Corp. Unit   Nasdaq

#          Date   Price Range Price Currency   Shares  Actions
#1 Feb 23, 2021 10.00 - 10.00     -      USD        - Expected
#2 Feb 23, 2021             -     -      USD        - Expected
#3 Feb 23, 2021 10.00 - 10.00     -      USD        - Expected
#4 Feb 01, 2021             -     6      GBP 45452752   Priced
#5 Nov 19, 2020   0.08 - 0.08   0.1      CAD  6000000   Priced
#6 Feb 23, 2021             -     -      USD        - Expected
#7 Feb 23, 2021             -     -      USD        - Expected
#8 Feb 23, 2021 10.00 - 10.00     -      USD        - Expected

Answer 2

这是解决问题的最简单方法，它也保留 headers :)

library(tidyverse)
library(rvest)

url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"

# Scrape the data

df <- url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="cal-res-table"]') %>% 
  as.character() %>% 
  XML::readHTMLTable()

# df is a list of two tables (as you can see from the website) - pick only the first list item

tbl <- as.data.frame(df[1])

# print your table
tbl
#>   NULL..Symbol.                                NULL.Company NULL.Exchange
#> 1         VELOU            Velocity Acquisition Corp. Units        Nasdaq
#> 2         FTAAU          FTAC Athena Acquisition Corp. Unit        Nasdaq
#> 3         CMIIU               CM Life Sciences II Inc. Unit        Nasdaq
#> 4                                            Metropress Ltd           LSE
#> 5      CTWO.P.V                        County Capital 2 Ltd          TSXV
#> 6         GSEVU              Gores Holdings VII, Inc. Units        Nasdaq
#> 7          NVOS Novo Integrated Sciences, Inc. Common Stock        Nasdaq
#> 8         SLAMU                             Slam Corp. Unit        Nasdaq
#>      NULL.Date NULL.Price.Range NULL.Price NULL.Currency NULL.Shares
#> 1 Feb 23, 2021    10.00 - 10.00          -           USD           -
#> 2 Feb 23, 2021                -          -           USD           -
#> 3 Feb 23, 2021    10.00 - 10.00          -           USD           -
#> 4 Feb 01, 2021                -          6           GBP    45452752
#> 5 Nov 19, 2020      0.08 - 0.08        0.1           CAD     6000000
#> 6 Feb 23, 2021                -          -           USD           -
#> 7 Feb 23, 2021                -          -           USD           -
#> 8 Feb 23, 2021    10.00 - 10.00          -           USD           -
#>   NULL.Actions
#> 1     Expected
#> 2     Expected
#> 3     Expected
#> 4       Priced
#> 5       Priced
#> 6     Expected
#> 7     Expected
#> 8     Expected

不过，您可能想要清理这些列名。 :)

R 从雅虎财经抓取 HTML table

R Scrape HTML table from Yahoo Finance

r

web-scraping

rvest