rvest html_table 读取不完整 table

Question

我正在尝试从以下地址获取 table：

https://www.investing.com/equities/penoles-historical-data

library(rvest)
library(dplyr)

base <- "https://www.investing.com/equities/penoles-historical-data"
data_df <- (read_html(base) %>% html_table)[[2]]

但只读取前 20 行

有没有办法读取所有信息？提前谢谢你

Answer 1

当您在 table 顶部的日历中设置日期范围时，有一个 API 调用，它会发出以下请求：

POST https://www.investing.com/instruments/HistoricalDataAjax

with 'X-Requested-With: XMLHttpRequest' as header and form url 对一些参数进行编码，包括开始日期和结束日期。有效负载还包括一个字段 curr_id，可以通过搜索具有属性 pair_ids

的 div 从主页抓取该字段

以下代码将具有日期范围的数据获取到数据框中：

library(rvest)
library(httr)

startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today

userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/equities/penoles-historical-data"

s <- html_session(mainUrl)

pair_ids <- s %>% 
    html_nodes("div[pair_ids]") %>%
    html_attr("pair_ids")

resp <- s %>% rvest:::request_POST(
    "https://www.investing.com/instruments/HistoricalDataAjax",
    add_headers('X-Requested-With'= 'XMLHttpRequest'),
    user_agent(userAgent),
    body = list(
        curr_id = pair_ids,
        header = "PEOLES Historical Data",
        st_date = format(startDate, format="%m/%d/%Y"),
        end_date = format(endDate, format="%m/%d/%Y"),
        interval_sec = "Daily",
        sort_col = "date",
        sort_ord = "DESC",
        action = "historical_data"
    ), 
    encode = "form") %>%
    html_table

print(resp[[1]])

python中的相同代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime

startDate = datetime.date(2020, 6, 1)
endDate = datetime.date.today()

userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"

s = requests.Session()

r = s.get("https://www.investing.com/equities/penoles-historical-data",
    headers= {
        "User-Agent": userAgent
    })
soup = BeautifulSoup(r.text, "html.parser")
pair_id = soup.find("div", attrs={"pair_ids":True})["pair_ids"]

r = s.post("https://www.investing.com/instruments/HistoricalDataAjax",
    headers= {
        "X-Requested-With": "XMLHttpRequest",
        "User-Agent": userAgent
    },
    data = {
        "curr_id": pair_id,
        "header": "PEOLES Historical Data",
        "st_date": startDate.strftime("%m/%d/%Y"),
        "end_date": endDate.strftime("%m/%d/%Y"),
        "interval_sec": "Daily",
        "sort_col": "date",
        "sort_ord": "DESC",
        "action": "historical_data"
    }
)
data = pd.read_html(r.text)[0]
print(data)

Try this on repl.it

rvest html_table 读取不完整 table

rvest html_table reads an incomplete table

r

web-scraping

rvest