R 从雅虎财经抓取 HTML table
R Scrape HTML table from Yahoo Finance
我想从 Yahoo Finance 抓取 table 并将其下载为数据框。
不幸的是,我真的不知道如何使用 rvest
-package.
这是第一种方法:
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
html() %>%
html_nodes(xpath="table") %>%
html_table()
如预期的那样,代码不起作用。
有人可以帮助我吗?
我想将框架 table 作为数据框:
非常感谢!
不幸的是,使用 html_table
并不容易提取 table table。这是一种从 table 中提取单个值并进行一些 post 处理以获取数据帧中的数据的方法。
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
read_html() %>%
html_nodes('table') %>%
.[[1]] -> tab1
header <- tab1 %>% html_nodes('th') %>% html_text()
result <- tab1%>%
html_nodes('tr.simpTblRow td') %>%
html_text() %>%
matrix(ncol = 9, byrow = TRUE) %>%
as.data.frame()
names(result) <- header
result
# Symbol Company Exchange
#1 VELOU Velocity Acquisition Corp. Units Nasdaq
#2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#4 Metropress Ltd LSE
#5 CTWO.P.V County Capital 2 Ltd TSXV
#6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#8 SLAMU Slam Corp. Unit Nasdaq
# Date Price Range Price Currency Shares Actions
#1 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#2 Feb 23, 2021 - - USD - Expected
#3 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#4 Feb 01, 2021 - 6 GBP 45452752 Priced
#5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000 Priced
#6 Feb 23, 2021 - - USD - Expected
#7 Feb 23, 2021 - - USD - Expected
#8 Feb 23, 2021 10.00 - 10.00 - USD - Expected
这是解决问题的最简单方法,它也保留 headers :)
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
# Scrape the data
df <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="cal-res-table"]') %>%
as.character() %>%
XML::readHTMLTable()
# df is a list of two tables (as you can see from the website) - pick only the first list item
tbl <- as.data.frame(df[1])
# print your table
tbl
#> NULL..Symbol. NULL.Company NULL.Exchange
#> 1 VELOU Velocity Acquisition Corp. Units Nasdaq
#> 2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#> 3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#> 4 Metropress Ltd LSE
#> 5 CTWO.P.V County Capital 2 Ltd TSXV
#> 6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#> 7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#> 8 SLAMU Slam Corp. Unit Nasdaq
#> NULL.Date NULL.Price.Range NULL.Price NULL.Currency NULL.Shares
#> 1 Feb 23, 2021 10.00 - 10.00 - USD -
#> 2 Feb 23, 2021 - - USD -
#> 3 Feb 23, 2021 10.00 - 10.00 - USD -
#> 4 Feb 01, 2021 - 6 GBP 45452752
#> 5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000
#> 6 Feb 23, 2021 - - USD -
#> 7 Feb 23, 2021 - - USD -
#> 8 Feb 23, 2021 10.00 - 10.00 - USD -
#> NULL.Actions
#> 1 Expected
#> 2 Expected
#> 3 Expected
#> 4 Priced
#> 5 Priced
#> 6 Expected
#> 7 Expected
#> 8 Expected
不过,您可能想要清理这些列名。 :)
我想从 Yahoo Finance 抓取 table 并将其下载为数据框。
不幸的是,我真的不知道如何使用 rvest
-package.
这是第一种方法:
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
html() %>%
html_nodes(xpath="table") %>%
html_table()
如预期的那样,代码不起作用。 有人可以帮助我吗?
我想将框架 table 作为数据框:
非常感谢!
不幸的是,使用 html_table
并不容易提取 table table。这是一种从 table 中提取单个值并进行一些 post 处理以获取数据帧中的数据的方法。
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
read_html() %>%
html_nodes('table') %>%
.[[1]] -> tab1
header <- tab1 %>% html_nodes('th') %>% html_text()
result <- tab1%>%
html_nodes('tr.simpTblRow td') %>%
html_text() %>%
matrix(ncol = 9, byrow = TRUE) %>%
as.data.frame()
names(result) <- header
result
# Symbol Company Exchange
#1 VELOU Velocity Acquisition Corp. Units Nasdaq
#2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#4 Metropress Ltd LSE
#5 CTWO.P.V County Capital 2 Ltd TSXV
#6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#8 SLAMU Slam Corp. Unit Nasdaq
# Date Price Range Price Currency Shares Actions
#1 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#2 Feb 23, 2021 - - USD - Expected
#3 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#4 Feb 01, 2021 - 6 GBP 45452752 Priced
#5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000 Priced
#6 Feb 23, 2021 - - USD - Expected
#7 Feb 23, 2021 - - USD - Expected
#8 Feb 23, 2021 10.00 - 10.00 - USD - Expected
这是解决问题的最简单方法,它也保留 headers :)
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
# Scrape the data
df <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="cal-res-table"]') %>%
as.character() %>%
XML::readHTMLTable()
# df is a list of two tables (as you can see from the website) - pick only the first list item
tbl <- as.data.frame(df[1])
# print your table
tbl
#> NULL..Symbol. NULL.Company NULL.Exchange
#> 1 VELOU Velocity Acquisition Corp. Units Nasdaq
#> 2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#> 3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#> 4 Metropress Ltd LSE
#> 5 CTWO.P.V County Capital 2 Ltd TSXV
#> 6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#> 7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#> 8 SLAMU Slam Corp. Unit Nasdaq
#> NULL.Date NULL.Price.Range NULL.Price NULL.Currency NULL.Shares
#> 1 Feb 23, 2021 10.00 - 10.00 - USD -
#> 2 Feb 23, 2021 - - USD -
#> 3 Feb 23, 2021 10.00 - 10.00 - USD -
#> 4 Feb 01, 2021 - 6 GBP 45452752
#> 5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000
#> 6 Feb 23, 2021 - - USD -
#> 7 Feb 23, 2021 - - USD -
#> 8 Feb 23, 2021 10.00 - 10.00 - USD -
#> NULL.Actions
#> 1 Expected
#> 2 Expected
#> 3 Expected
#> 4 Priced
#> 5 Priced
#> 6 Expected
#> 7 Expected
#> 8 Expected
不过,您可能想要清理这些列名。 :)