打印使用 lxml 抓取和解析的 table 数据
Printing table data scraped and parsed using lxml
我正在尝试以类似 pandas 数据框的形式打印数据,但我不知道从中提取信息。
import requests
response=requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')
# print(response.text)
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
print(tr_elements[0])
============输出==========
<Element tr at 0x3f7ccf0>
很难确切地知道您想要实现什么,但以下内容将采用您拥有的 table 数据并创建一个包含基于 table [=13= 的列名的数据框]s,然后填充的数据没有重复 table header 行。
import lxml.html as lh
import requests
import pandas as pd
response = requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
headers = [header.text for header in tr_elements[0]] //get the table headers
rows = []
for element in tr_elements[1:]:
row = [data.text for data in element] //get the non nested table elements
if row[1] is None:
row[1] = element[1][0].text // get player name from hyper link
if not isinstance(row[4], str):
row[4] = element[4][0].text //get team name from hyper link
rows.append(row)
df = pd.DataFrame.from_records(rows)
df.columns = headers
print(df)
我正在尝试以类似 pandas 数据框的形式打印数据,但我不知道从中提取信息。
import requests
response=requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')
# print(response.text)
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
print(tr_elements[0])
============输出==========
<Element tr at 0x3f7ccf0>
很难确切地知道您想要实现什么,但以下内容将采用您拥有的 table 数据并创建一个包含基于 table [=13= 的列名的数据框]s,然后填充的数据没有重复 table header 行。
import lxml.html as lh
import requests
import pandas as pd
response = requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
headers = [header.text for header in tr_elements[0]] //get the table headers
rows = []
for element in tr_elements[1:]:
row = [data.text for data in element] //get the non nested table elements
if row[1] is None:
row[1] = element[1][0].text // get player name from hyper link
if not isinstance(row[4], str):
row[4] = element[4][0].text //get team name from hyper link
rows.append(row)
df = pd.DataFrame.from_records(rows)
df.columns = headers
print(df)