打印使用 lxml 抓取和解析的 table 数据

Printing table data scraped and parsed using lxml

我正在尝试以类似 pandas 数据框的形式打印数据,但我不知道从中提取信息。

import requests

response=requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')

# print(response.text)

doc = lh.fromstring(response.content)

tr_elements = doc.xpath('//tr')

print(tr_elements[0])

============输出==========

<Element tr at 0x3f7ccf0>

很难确切地知道您想要实现什么,但以下内容将采用您拥有的 table 数据并创建一个包含基于 table [=13= 的列名的数据框]s,然后填充的数据没有重复 table header 行。

import lxml.html as lh
import requests
import pandas as pd

response = requests.get('https://www.basketball-reference.com/leagues/NBA_2018_advanced.html')
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
headers = [header.text for header in tr_elements[0]] //get the table headers
rows = []
for element in tr_elements[1:]:
    row = [data.text for data in element]  //get the non nested table elements
    if row[1] is None:
        row[1] = element[1][0].text // get player name from hyper link
        if not isinstance(row[4], str):
            row[4] = element[4][0].text //get team name from hyper link
        rows.append(row)

df = pd.DataFrame.from_records(rows)
df.columns = headers
print(df)