读取 HTML table 数据分布在各行中
Reading HTML table with data spread across rows
我已经使用 BeautifulSoup 提取了 HTML table,并想将其导入 pandas DataFrame
。但是,原始 table 中的数据分布在多行中。这里有两个词条供参考:
<table>
<tbody><tr>
<td>Record : 1 of 749</td>
</tr>
<tr>
<td width="111">Patients Name</td>
<td width="4">:</td>
<td colspan="4">Andrew Smith</td>
</tr>
<tr>
<td>Admit Date</td>
<td>:</td>
<td width="189">20-MAR-2018</td>
<td>Group Number </td>
<td>:</td>
<td>17</td>
</tr>
<tr>
<td>Address</td>
<td>:</td>
<td>123 Sunshine Ave </td>
<td>Postal Code </td>
<td>:</td>
<td>12345</td>
</tr>
<tr>
<td>Blood Type</td>
<td>:</td>
<td>A </td>
<td width="96">Ward Class</td>
<td width="4">:</td>
<td width="174">A</td>
</tr>
<tr>
<td>Age</td>
<td>:</td>
<td>45</td>
<td>Height</td>
<td>:</td>
<td>
174cm
</td>
</tr>
<tr>
<td>Weight</td>
<td>:</td>
<td>102kg</td>
<td>ID</td>
<td>:</td>
<td>
013</td>
</tr>
<tr>
<td><hr/></td>
</tr>
<tr>
<td>Record : 2 of 749</td>
</tr>
<tr>
<td width="111">Patients Name</td>
<td width="4">:</td>
<td colspan="4">Margaret Chow</td>
</tr>
<tr>
<td>Admit Date</td>
<td>:</td>
<td width="189">19-MAR-2018</td>
<td>Group Number </td>
<td>:</td>
<td>14</td>
</tr>
<tr>
<td>Address</td>
<td>:</td>
<td>5 Mango Beach </td>
<td>Postal Code </td>
<td>:</td>
<td>54321</td>
</tr>
<tr>
<td>Blood Type</td>
<td>:</td>
<td>B </td>
<td width="96">Ward Class</td>
<td width="4">:</td>
<td width="174">B2</td>
</tr>
<tr>
<td>Age</td>
<td>:</td>
<td>32</td>
<td>Height</td>
<td>:</td>
<td>
154cm
</td>
</tr>
<tr>
<td>Weight</td>
<td>:</td>
<td>52kg</td>
<td>ID</td>
<td>:</td>
<td>
051</td>
</tr>
<tr>
<td><hr/></td>
</tr>
</tbody></table>
我使用以下代码将上面的 table 提取到 pandas DataFrame 中:
import pandas as pd
table = str(table)
df = pd.read_html(table)
df = pd.DataFrame(df)
df
我的 df 是这样的:
但我希望它是一个包含 ['Patients Name'、'Admit Date'、'Group Number'、'Address'、'Postal Code' 列的 DataFrame
, 'Blood Type', 'Ward Class', 'Age', 'Height', 'Weight', 'ID'].
我是新手。非常感谢任何建议!
import pandas as pd
from bs4 import BeautifulSoup as bs
soup = bs(table, 'html.parser')
df = pd.DataFrame() # you can add index and column details at this point too
row_index = -1
for row in soup.find_all('tr'):
if row.find('td').find('hr'): # few rows has a horizontal line; skipping them
continue
if len(row.find_all('td')) == 1: # skipping the row stating Record : 1 of ...
#if row.find_all('td')[0].get_text().startswith('Record :'):
row_index += 1
continue
tds = [td.get_text().strip() for td in row.find_all('td')]
df.at[row_index, tds[0]] = tds[2]
if len(tds) > 3: #few rows have multiple tds; might have to make this dynamic if its more than 2 fields per row
df.at[row_index, tds[3]] = tds[5]
这也是我第一次使用 web-scraping,我很高兴找到解决方案!这段代码适用于您定义的问题。您可能必须根据 table 结构更改某些条件。
PS:这是我在 Stack Overflow 上的第一个回答,我真的希望这对您有所帮助 :)
我已经使用 BeautifulSoup 提取了 HTML table,并想将其导入 pandas DataFrame
。但是,原始 table 中的数据分布在多行中。这里有两个词条供参考:
<table>
<tbody><tr>
<td>Record : 1 of 749</td>
</tr>
<tr>
<td width="111">Patients Name</td>
<td width="4">:</td>
<td colspan="4">Andrew Smith</td>
</tr>
<tr>
<td>Admit Date</td>
<td>:</td>
<td width="189">20-MAR-2018</td>
<td>Group Number </td>
<td>:</td>
<td>17</td>
</tr>
<tr>
<td>Address</td>
<td>:</td>
<td>123 Sunshine Ave </td>
<td>Postal Code </td>
<td>:</td>
<td>12345</td>
</tr>
<tr>
<td>Blood Type</td>
<td>:</td>
<td>A </td>
<td width="96">Ward Class</td>
<td width="4">:</td>
<td width="174">A</td>
</tr>
<tr>
<td>Age</td>
<td>:</td>
<td>45</td>
<td>Height</td>
<td>:</td>
<td>
174cm
</td>
</tr>
<tr>
<td>Weight</td>
<td>:</td>
<td>102kg</td>
<td>ID</td>
<td>:</td>
<td>
013</td>
</tr>
<tr>
<td><hr/></td>
</tr>
<tr>
<td>Record : 2 of 749</td>
</tr>
<tr>
<td width="111">Patients Name</td>
<td width="4">:</td>
<td colspan="4">Margaret Chow</td>
</tr>
<tr>
<td>Admit Date</td>
<td>:</td>
<td width="189">19-MAR-2018</td>
<td>Group Number </td>
<td>:</td>
<td>14</td>
</tr>
<tr>
<td>Address</td>
<td>:</td>
<td>5 Mango Beach </td>
<td>Postal Code </td>
<td>:</td>
<td>54321</td>
</tr>
<tr>
<td>Blood Type</td>
<td>:</td>
<td>B </td>
<td width="96">Ward Class</td>
<td width="4">:</td>
<td width="174">B2</td>
</tr>
<tr>
<td>Age</td>
<td>:</td>
<td>32</td>
<td>Height</td>
<td>:</td>
<td>
154cm
</td>
</tr>
<tr>
<td>Weight</td>
<td>:</td>
<td>52kg</td>
<td>ID</td>
<td>:</td>
<td>
051</td>
</tr>
<tr>
<td><hr/></td>
</tr>
</tbody></table>
我使用以下代码将上面的 table 提取到 pandas DataFrame 中:
import pandas as pd
table = str(table)
df = pd.read_html(table)
df = pd.DataFrame(df)
df
我的 df 是这样的:
但我希望它是一个包含 ['Patients Name'、'Admit Date'、'Group Number'、'Address'、'Postal Code' 列的 DataFrame
, 'Blood Type', 'Ward Class', 'Age', 'Height', 'Weight', 'ID'].
我是新手。非常感谢任何建议!
import pandas as pd
from bs4 import BeautifulSoup as bs
soup = bs(table, 'html.parser')
df = pd.DataFrame() # you can add index and column details at this point too
row_index = -1
for row in soup.find_all('tr'):
if row.find('td').find('hr'): # few rows has a horizontal line; skipping them
continue
if len(row.find_all('td')) == 1: # skipping the row stating Record : 1 of ...
#if row.find_all('td')[0].get_text().startswith('Record :'):
row_index += 1
continue
tds = [td.get_text().strip() for td in row.find_all('td')]
df.at[row_index, tds[0]] = tds[2]
if len(tds) > 3: #few rows have multiple tds; might have to make this dynamic if its more than 2 fields per row
df.at[row_index, tds[3]] = tds[5]
这也是我第一次使用 web-scraping,我很高兴找到解决方案!这段代码适用于您定义的问题。您可能必须根据 table 结构更改某些条件。
PS:这是我在 Stack Overflow 上的第一个回答,我真的希望这对您有所帮助 :)