rvest 无法使用 html_nodes("table") 获取 html table,尽管 table 在页面上
rvest not able to grab html table using html_nodes("table"), despite table being on page
我们正在努力争取 this fangraphs link 的主要 table。使用 rvest
:
url = 'https://www.fangraphs.com/leaders/splits-leaderboards?splitArr=1&splitArrPitch=&position=B&autoPt=false&splitTeams=false&statType=team&statgroup=2&startDate=2021-07-07&endDate=2021-07-21&players=&filter=&groupBy=season&sort=9,1'
table_nodes = url %>% read_html() %>% html_nodes('table')
table_nodes
table_nodes
{xml_nodeset (7)}
[1] <table class="menu-standings-table"><tbody><tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[2] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[3] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[4] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-45-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-34-prospects ...
[5] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-30-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-49-prospects-ch ...
[6] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-40-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-38-prospects ...
[7] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-27-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-41-prospects-ch ...
None 这 7 table 是 主要 table 在 URL 与所有不同的团队统计数据。 url %>% read_html() %>% html_nodes('div.table-scroll')
returns 一个空节点集,div.table-scroll
是主要 table 所在的包装器 div。
编辑: 我想这是网络请求,但仍然不确定如何从中获得 API 调用。如何查看完整的 API 调用?
数据是从 API 调用中动态检索的。切换到 httr,因为您需要发出包含 start/end 日期的 POST 请求。此外,在返回尽可能多的数据和尽可能少的调用方面切换到无限。
您想将以下内容转换为某种形式的接受日期参数的自定义函数。
library(httr)
library(purrr)
headers = c(
'user-agent' = 'Mozilla/5.0',
'content-type' = 'application/json;charset=UTF-8'
)
data = '{"strPlayerId":"all","strSplitArr":[1],"strGroup":"season","strPosition":"B","strType":"2","strStartDate":"2021-07-07","strEndDate":"2021-07-21","strSplitTeams":false,"dctFilters":[],"strStatType":"team","strAutoPt":"false","arrPlayerId":[],"strSplitArrPitch":[]}'
r <- httr::POST(url = 'https://www.fangraphs.com/api/leaders/splits/splits-leaders', httr::add_headers(.headers=headers), body = data) %>% content()
df <- map_df(r$data, data.frame)
我们正在努力争取 this fangraphs link 的主要 table。使用 rvest
:
url = 'https://www.fangraphs.com/leaders/splits-leaderboards?splitArr=1&splitArrPitch=&position=B&autoPt=false&splitTeams=false&statType=team&statgroup=2&startDate=2021-07-07&endDate=2021-07-21&players=&filter=&groupBy=season&sort=9,1'
table_nodes = url %>% read_html() %>% html_nodes('table')
table_nodes
table_nodes
{xml_nodeset (7)}
[1] <table class="menu-standings-table"><tbody><tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[2] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[3] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[4] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-45-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-34-prospects ...
[5] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-30-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-49-prospects-ch ...
[6] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-40-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-38-prospects ...
[7] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-27-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-41-prospects-ch ...
None 这 7 table 是 主要 table 在 URL 与所有不同的团队统计数据。 url %>% read_html() %>% html_nodes('div.table-scroll')
returns 一个空节点集,div.table-scroll
是主要 table 所在的包装器 div。
编辑: 我想这是网络请求,但仍然不确定如何从中获得 API 调用。如何查看完整的 API 调用?
数据是从 API 调用中动态检索的。切换到 httr,因为您需要发出包含 start/end 日期的 POST 请求。此外,在返回尽可能多的数据和尽可能少的调用方面切换到无限。
您想将以下内容转换为某种形式的接受日期参数的自定义函数。
library(httr)
library(purrr)
headers = c(
'user-agent' = 'Mozilla/5.0',
'content-type' = 'application/json;charset=UTF-8'
)
data = '{"strPlayerId":"all","strSplitArr":[1],"strGroup":"season","strPosition":"B","strType":"2","strStartDate":"2021-07-07","strEndDate":"2021-07-21","strSplitTeams":false,"dctFilters":[],"strStatType":"team","strAutoPt":"false","arrPlayerId":[],"strSplitArrPitch":[]}'
r <- httr::POST(url = 'https://www.fangraphs.com/api/leaders/splits/splits-leaders', httr::add_headers(.headers=headers), body = data) %>% content()
df <- map_df(r$data, data.frame)