如何使用 Python (lxml, html, requests, xpath) 从一个页面获取不同的表格?
How to get different tables from one page using Python (lxml, html, requests, xpath)?
我正在尝试从 https://www.premierleague.com/tables 获取英超 table 的数据。我可以通过下面的代码获取数据,但不幸的是它只适用于最新的季节选项(2018/2019)。该页面还提供其他季节的 tables(2017/2018,...),我怎样才能刮到另一个 table s?
from lxml import html
import requests
page = requests.get('https://www.premierleague.com/tables')
tree = html.fromstring( page.content )
team_rows = tree.xpath('//table//tbody//tr[@data-filtered-table-row-name]')[0:20]
team_names = [i.attrib['data-filtered-table-row-name'] for i in team_rows]
teams = {}
for i in range(20):
element = team_rows[i]
teams[team_names[i]] = element.getchildren()
for i in team_names:
values = [j.text_content() for j in teams[i]]
row = "{} "*9
print( row.format(i, *values[3:12] ) )
but unfortunately it only works for the latest season option (2018/2019)
网站正在使用 JavaScript 加载旧的 table(1992-2017),因此当您使用 Python 访问时,您会获得最新的 table。如果你想通过 year/session 抓取 table 过滤器,我为你提供了一个硬代码版本(因为我没有找到年数的规则)。但是你想更优雅地完成它,selenium 或 requests_html 可能适合你。
注意:我模仿JavaScript从服务器获取数据,所以响应的内容是json类型。而且它只能获得不同年份的英超联赛 table。 Filter by competition/matchweek/home_or_away 在我的示例中不可用。如果你想将这些选项添加到脚本中,你应该分析 url 参数的规则(使用@pguardiario 所说的方式或使用一些工具,如 fiddler)。
import requests
from pprint import pprint
years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
"2018":"210",
"2017":"79",
"2016":"54",
"2015":"42",
"2014":"27"
})
specific = years.get("2017")
param = {
"altIds":"true",
"compSeasons":specific,
"detail":2,
"FOOTBALL_COMPETITION":1
}
headers = {
"Origin": "https://www.premierleague.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Referer": "https://www.premierleague.com/tables?co=1&se={}&ha=-1".format(specific),
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
}
page = requests.get('https://footballapi.pulselive.com/football/standings',
params=param,
headers=headers
)
print(page.url)
pprint(page.json())
How to get different tables from one page
我觉得你的问题标题和你的描述不一样。如果是这样,另一个问题是您将所有 table 合并为一个。你应该注意 //
What is meaning of .// in XPath?.
注意:如果你想获取英超联赛的旧数据table,请使用我在第一部分的代码。因为那些数据只能从那里得到。
from lxml import html
import requests
from pprint import pprint
years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
"2018":"210",
"2017":"79",
"2016":"54",
"2015":"42",
"2014":"27"
})
param = {
"co":"1",
"se":years.get("2017"),
"ha":"-1"
}
page = requests.get('https://www.premierleague.com/tables', params=param)
tree = html.fromstring( page.content )
tables = tree.xpath('//tbody[contains(@class,"tableBodyContainer")]')
each_table_team_rows = [table.xpath('tr[@data-filtered-table-row-name]') for table in tables]
team_names = [[i.attrib['data-filtered-table-row-name'] for i in team_rows] for team_rows in each_table_team_rows]
pprint(team_names)
我正在尝试从 https://www.premierleague.com/tables 获取英超 table 的数据。我可以通过下面的代码获取数据,但不幸的是它只适用于最新的季节选项(2018/2019)。该页面还提供其他季节的 tables(2017/2018,...),我怎样才能刮到另一个 table s?
from lxml import html
import requests
page = requests.get('https://www.premierleague.com/tables')
tree = html.fromstring( page.content )
team_rows = tree.xpath('//table//tbody//tr[@data-filtered-table-row-name]')[0:20]
team_names = [i.attrib['data-filtered-table-row-name'] for i in team_rows]
teams = {}
for i in range(20):
element = team_rows[i]
teams[team_names[i]] = element.getchildren()
for i in team_names:
values = [j.text_content() for j in teams[i]]
row = "{} "*9
print( row.format(i, *values[3:12] ) )
but unfortunately it only works for the latest season option (2018/2019)
网站正在使用 JavaScript 加载旧的 table(1992-2017),因此当您使用 Python 访问时,您会获得最新的 table。如果你想通过 year/session 抓取 table 过滤器,我为你提供了一个硬代码版本(因为我没有找到年数的规则)。但是你想更优雅地完成它,selenium 或 requests_html 可能适合你。
注意:我模仿JavaScript从服务器获取数据,所以响应的内容是json类型。而且它只能获得不同年份的英超联赛 table。 Filter by competition/matchweek/home_or_away 在我的示例中不可用。如果你想将这些选项添加到脚本中,你应该分析 url 参数的规则(使用@pguardiario 所说的方式或使用一些工具,如 fiddler)。
import requests
from pprint import pprint
years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
"2018":"210",
"2017":"79",
"2016":"54",
"2015":"42",
"2014":"27"
})
specific = years.get("2017")
param = {
"altIds":"true",
"compSeasons":specific,
"detail":2,
"FOOTBALL_COMPETITION":1
}
headers = {
"Origin": "https://www.premierleague.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Referer": "https://www.premierleague.com/tables?co=1&se={}&ha=-1".format(specific),
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
}
page = requests.get('https://footballapi.pulselive.com/football/standings',
params=param,
headers=headers
)
print(page.url)
pprint(page.json())
How to get different tables from one page
我觉得你的问题标题和你的描述不一样。如果是这样,另一个问题是您将所有 table 合并为一个。你应该注意 //
What is meaning of .// in XPath?.
注意:如果你想获取英超联赛的旧数据table,请使用我在第一部分的代码。因为那些数据只能从那里得到。
from lxml import html
import requests
from pprint import pprint
years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
"2018":"210",
"2017":"79",
"2016":"54",
"2015":"42",
"2014":"27"
})
param = {
"co":"1",
"se":years.get("2017"),
"ha":"-1"
}
page = requests.get('https://www.premierleague.com/tables', params=param)
tree = html.fromstring( page.content )
tables = tree.xpath('//tbody[contains(@class,"tableBodyContainer")]')
each_table_team_rows = [table.xpath('tr[@data-filtered-table-row-name]') for table in tables]
team_names = [[i.attrib['data-filtered-table-row-name'] for i in team_rows] for team_rows in each_table_team_rows]
pprint(team_names)