从 HTML 中提取特定列

Question

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'

df = pd.read_html(url, parse_dates=[0])
df1=df[0]
df2=df[1]
df3=df[2]
df4=df[3]

这是我的代码，我可以看到每个 table 这样的代码

0   1   2   3   4   5   6   7   8   9   ... 35  36  37  38  39  40  41  42  43  44
0   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   ... NaN NaN NaN NaN {1} NaN NaN NaN 205713029.83    NaN
4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88  Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90  NaN NaN Class   Class   NaN NaN BeginningNote Balance   BeginningNote Balance   NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91  {46}    NaN NaN Class C NaN NaN NaN    NaN NaN ... {46}    NaN NaN NaN    NaN NaN NaN NaN NaN

但是，我的项目需要提取特定的列：

Defaulted Receivables: Line 4
Ending Tranche Balance (all tranches): Line 19
Regular Principal Collections: Line 22
Recoveries: Line 23
Prepayments: Line 24
Interest Collections: Line 25 + Line 26 + Line 27
Ending Reserve Account Balance: Line 63
Ending Pool Balance: Line 79
60 Day Delinquencies: Line 84
90 Day Delinquencies: Line 85
90+ Day Delinquencies: Line 86 + Line 87

那么我怎样才能从 df 中获取特定的列呢？或者如何让我的 df 更易读？

Answer 1

想到三个选项：

pd.dropna()

df[1].dropna(axis=0,how='all')

这将删除所有元素均为 NaN 的所有行。

索引和 iloc

i = [1,3,5]
df[1].iloc[i]

如果我知道所需行的位置，那么我可以使用 iloc 将它们拉出

pd.isnull 和 loc

df[1].loc[pd.isnull(df[1][0])==False]

这将 select 仅在第 0 列中不为 NaN 的行。同样，loc 可用于匹配列中的特定字符串。

Answer 2

您可以尝试此示例从 HTML:

中提取指定的行

import requests
from bs4 import BeautifulSoup


def get_row(soup, n):
    return [td.get_text(strip=True) for td in soup.select('tr:contains("{' + str(n) + '}") td') if td.get_text(strip=True)]

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

row_numbers = [4, 19, 22, 23, 24, 25, 26, 27, 63, 79, 84, 85, 86, 87]

for n in row_numbers:
    print(get_row(soup, n))

打印：

['{4}Defaulted Receivables', '{4}', '1,310,326.05']
['{19}End of period Note Balance', '{19}', '—', '—', '—', '—', '—', '103,359,894.20', '48,960,000.00', '152,319,894.20']
['{22}Principal Payments Received', '{22}', '8,508,993.67']
['{23}Liquidation Proceeds', '{23}', '1,417,885.33']
['{24}Principal on Repurchased Receivables', '{24}', '136,546.52']
['{25}Interest on Repurchased Receivables', '{25}', '7,927.83']
['{26}Interest collected on Receivables', '{26}', '2,584,253.82']
['{27}Other amounts received', '{27}', '27,116.85']
['{63}End of period Reserve Account balance', '{63}', '12,240,151.27']
['{79}Principal Balance of the Receivables', '{79}', '1,224,015,127.29', '205,713,029.83', '195,904,816.03']
['{84}31-60days', '{84}', '1,059', '12,688,115.93', '6.48', '%']
['{85}61-90days', '{85}', '397', '4,772,733.21', '2.44', '%']
['{86}91-120days', '{86}', '142', '1,628,631.34', '0.83', '%']
['{87}121 + days delinquent', '{87}', '—', '—', '0.00', '%']

从 HTML 中提取特定列

Extract specific column from HTML

html

python

extract

web-scraping