rvest html_table 读取不完整 table
rvest html_table reads an incomplete table
我正在尝试从以下地址获取 table:
library(rvest)
library(dplyr)
base <- "https://www.investing.com/equities/penoles-historical-data"
data_df <- (read_html(base) %>% html_table)[[2]]
但只读取前 20 行
有没有办法读取所有信息?
提前谢谢你
当您在 table 顶部的日历中设置日期范围时,有一个 API 调用,它会发出以下请求:
POST https://www.investing.com/instruments/HistoricalDataAjax
with 'X-Requested-With: XMLHttpRequest'
as header and form url 对一些参数进行编码,包括开始日期和结束日期。有效负载还包括一个字段 curr_id
,可以通过搜索具有属性 pair_ids
的 div
从主页抓取该字段
以下代码将具有日期范围的数据获取到数据框中:
library(rvest)
library(httr)
startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/equities/penoles-historical-data"
s <- html_session(mainUrl)
pair_ids <- s %>%
html_nodes("div[pair_ids]") %>%
html_attr("pair_ids")
resp <- s %>% rvest:::request_POST(
"https://www.investing.com/instruments/HistoricalDataAjax",
add_headers('X-Requested-With'= 'XMLHttpRequest'),
user_agent(userAgent),
body = list(
curr_id = pair_ids,
header = "PEOLES Historical Data",
st_date = format(startDate, format="%m/%d/%Y"),
end_date = format(endDate, format="%m/%d/%Y"),
interval_sec = "Daily",
sort_col = "date",
sort_ord = "DESC",
action = "historical_data"
),
encode = "form") %>%
html_table
print(resp[[1]])
python中的相同代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
startDate = datetime.date(2020, 6, 1)
endDate = datetime.date.today()
userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
s = requests.Session()
r = s.get("https://www.investing.com/equities/penoles-historical-data",
headers= {
"User-Agent": userAgent
})
soup = BeautifulSoup(r.text, "html.parser")
pair_id = soup.find("div", attrs={"pair_ids":True})["pair_ids"]
r = s.post("https://www.investing.com/instruments/HistoricalDataAjax",
headers= {
"X-Requested-With": "XMLHttpRequest",
"User-Agent": userAgent
},
data = {
"curr_id": pair_id,
"header": "PEOLES Historical Data",
"st_date": startDate.strftime("%m/%d/%Y"),
"end_date": endDate.strftime("%m/%d/%Y"),
"interval_sec": "Daily",
"sort_col": "date",
"sort_ord": "DESC",
"action": "historical_data"
}
)
data = pd.read_html(r.text)[0]
print(data)
我正在尝试从以下地址获取 table:
library(rvest)
library(dplyr)
base <- "https://www.investing.com/equities/penoles-historical-data"
data_df <- (read_html(base) %>% html_table)[[2]]
但只读取前 20 行
有没有办法读取所有信息? 提前谢谢你
当您在 table 顶部的日历中设置日期范围时,有一个 API 调用,它会发出以下请求:
POST https://www.investing.com/instruments/HistoricalDataAjax
with 'X-Requested-With: XMLHttpRequest'
as header and form url 对一些参数进行编码,包括开始日期和结束日期。有效负载还包括一个字段 curr_id
,可以通过搜索具有属性 pair_ids
div
从主页抓取该字段
以下代码将具有日期范围的数据获取到数据框中:
library(rvest)
library(httr)
startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/equities/penoles-historical-data"
s <- html_session(mainUrl)
pair_ids <- s %>%
html_nodes("div[pair_ids]") %>%
html_attr("pair_ids")
resp <- s %>% rvest:::request_POST(
"https://www.investing.com/instruments/HistoricalDataAjax",
add_headers('X-Requested-With'= 'XMLHttpRequest'),
user_agent(userAgent),
body = list(
curr_id = pair_ids,
header = "PEOLES Historical Data",
st_date = format(startDate, format="%m/%d/%Y"),
end_date = format(endDate, format="%m/%d/%Y"),
interval_sec = "Daily",
sort_col = "date",
sort_ord = "DESC",
action = "historical_data"
),
encode = "form") %>%
html_table
print(resp[[1]])
python中的相同代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
startDate = datetime.date(2020, 6, 1)
endDate = datetime.date.today()
userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
s = requests.Session()
r = s.get("https://www.investing.com/equities/penoles-historical-data",
headers= {
"User-Agent": userAgent
})
soup = BeautifulSoup(r.text, "html.parser")
pair_id = soup.find("div", attrs={"pair_ids":True})["pair_ids"]
r = s.post("https://www.investing.com/instruments/HistoricalDataAjax",
headers= {
"X-Requested-With": "XMLHttpRequest",
"User-Agent": userAgent
},
data = {
"curr_id": pair_id,
"header": "PEOLES Historical Data",
"st_date": startDate.strftime("%m/%d/%Y"),
"end_date": endDate.strftime("%m/%d/%Y"),
"interval_sec": "Daily",
"sort_col": "date",
"sort_ord": "DESC",
"action": "historical_data"
}
)
data = pd.read_html(r.text)[0]
print(data)