从 HTML 中提取 Table 信息(作为文本文件)
Extract Table Information from HTML (As Text File)
我正在尝试从 html 文件中的 table 中提取信息,我想将其用作文本,因为我只能通过 VPN 访问该文件,所以我已经下载了所有我需要的 html 个文件。
我想专门从同一个 table class 的多个 table 中获取信息,但是当我尝试获取信息时,没有返回任何信息。我附上了我试图用来获取此信息但没有成功的代码。
下面还有一个 html 文件,我一直试图从中获取信息,但是它很大,所以我希望这不会成为问题
Table Information
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r1c1">
<td class="regbutton NRPS-like r1c1">
<a href="#r1c1">Region 1.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">21,469</td>
<td class="digits table-split-left">62,957</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001740/1" target="_blank">phthoxazolin</a></td>
<td>NRP + Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 4%, #ffffff00 4%)">4%</td>
</tr>
<tr class="linked-row even" data-anchor="#r1c2">
<td class="regbutton NRPS r1c2">
<a href="#r1c2">Region 1.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">74,163</td>
<td class="digits table-split-left">124,963</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001709/1" target="_blank">nystatin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 10%, #ffffff00 10%)">10%</td>
</tr>
</tbody>
</table>
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r2c1">
<td class="regbutton terpene r2c1">
<a href="#r2c1">Region 2.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">3,800</td>
<td class="digits table-split-left">23,263</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001580/1" target="_blank">ebelactone</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 5%, #ffffff00 5%)">5%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c2">
<td class="regbutton NRPS-like r2c2">
<a href="#r2c2">Region 2.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">55,320</td>
<td class="digits table-split-left">97,088</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000727/1" target="_blank">indigoidine</a></td>
<td>Saccharide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 17%, #ffffff00 17%)">17%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c3">
<td class="regbutton NRPS r2c3">
<a href="#r2c3">Region 2.3</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">144,740</td>
<td class="digits table-split-left">193,599</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000368/1" target="_blank">streptobactin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(210, 105, 30, 0.3), rgba(210, 105, 30, 0.3) 70%, #ffffff00 70%)">70%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c4">
<td class="regbutton siderophore r2c4">
<a href="#r2c4">Region 2.4</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#siderophore" target="_blank">siderophore</a>
</td>
<td class="digits">347,862</td>
<td class="digits table-split-left">362,833</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001593/1" target="_blank">ficellomycin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 3%, #ffffff00 3%)">3%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c5">
<td class="regbutton lassopeptide r2c5">
<a href="#r2c5">Region 2.5</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#lassopeptide" target="_blank">lassopeptide</a>
</td>
<td class="digits">548,017</td>
<td class="digits table-split-left">570,561</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001435/1" target="_blank">ikarugamycin</a></td>
<td>NRP + Polyketide:Iterative type I</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c6">
<td class="regbutton NRPS r2c6">
<a href="#r2c6">Region 2.6</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">628,834</td>
<td class="digits table-split-left">683,050</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001117/1" target="_blank">himastatin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c7">
<td class="regbutton NRPS,terpene hybrid r2c7">
<a href="#r2c7">Region 2.7</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>,<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">1,043,511</td>
<td class="digits table-split-left">1,104,786</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0002024/1" target="_blank">nargenicin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 11%, #ffffff00 11%)">11%</td>
</tr>
</tbody>
</table>
代码片段
soup = BeautifulSoup(html, "lxml")
gdp_table = soup.find("table", attrs={"class": "region-table"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows
# Get all the headings of Lists
print ("Extracted {num} Region-Tables".format(num=len(gdp_table_data)))
print(gdp_table_data[0]) #print first table
print(gdp_table_data[1]) #print second table
理想情况下,我想输入 html 文件并提取所有不同的 tables 信息,合并为一个大 table 并可能输出为 csv .
从文件中提取 HTML 数据并导出单独的 csv。
import csv
from simplified_scrapy import SimplifiedDoc,req,utils
name = 'test.html'
html = utils.getFileContent(name) # Get data from file
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
如果您想每个 table
保留一个文件
doc = SimplifiedDoc(html)
i=0
tables = doc.selects('table.region-table')
for table in tables:
i+=1
rows = []
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+str(i)+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
保留原来的比较。
import csv
from simplified_scrapy import SimplifiedDoc,req
html = '''''' # Your HTML
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
# If you have '>Region.*?</a>' in each row, you can get all the rows directly in the following way
# trs = doc.getElementsByReg('>Region.*?</a>',tag='tr')
# for tr in trs:
# rows.append([td.text for td in tr.tds])
with open('test.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
结果:
Region 1.1,NRPS-like,"21,469","62,957",phthoxazolin,NRP + Polyketide,4%
Region 1.2,NRPS,"74,163","124,963",nystatin,Polyketide,10%
Region 2.1,terpene,"3,800","23,263",ebelactone,Polyketide,5%
Region 2.2,NRPS-like,"55,320","97,088",indigoidine,Saccharide,17%
Region 2.3,NRPS,"144,740","193,599",streptobactin,NRP,70%
Region 2.4,siderophore,"347,862","362,833",ficellomycin,NRP,3%
Region 2.5,lassopeptide,"548,017","570,561",ikarugamycin,NRP + Polyketide:Iterative type I,12%
Region 2.6,NRPS,"628,834","683,050",himastatin,NRP,12%
Region 2.7,"NRPS,terpene","1,043,511","1,104,786",nargenicin,Polyketide,11%
我正在尝试从 html 文件中的 table 中提取信息,我想将其用作文本,因为我只能通过 VPN 访问该文件,所以我已经下载了所有我需要的 html 个文件。
我想专门从同一个 table class 的多个 table 中获取信息,但是当我尝试获取信息时,没有返回任何信息。我附上了我试图用来获取此信息但没有成功的代码。
下面还有一个 html 文件,我一直试图从中获取信息,但是它很大,所以我希望这不会成为问题
Table Information
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r1c1">
<td class="regbutton NRPS-like r1c1">
<a href="#r1c1">Region 1.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">21,469</td>
<td class="digits table-split-left">62,957</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001740/1" target="_blank">phthoxazolin</a></td>
<td>NRP + Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 4%, #ffffff00 4%)">4%</td>
</tr>
<tr class="linked-row even" data-anchor="#r1c2">
<td class="regbutton NRPS r1c2">
<a href="#r1c2">Region 1.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">74,163</td>
<td class="digits table-split-left">124,963</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001709/1" target="_blank">nystatin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 10%, #ffffff00 10%)">10%</td>
</tr>
</tbody>
</table>
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r2c1">
<td class="regbutton terpene r2c1">
<a href="#r2c1">Region 2.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">3,800</td>
<td class="digits table-split-left">23,263</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001580/1" target="_blank">ebelactone</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 5%, #ffffff00 5%)">5%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c2">
<td class="regbutton NRPS-like r2c2">
<a href="#r2c2">Region 2.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">55,320</td>
<td class="digits table-split-left">97,088</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000727/1" target="_blank">indigoidine</a></td>
<td>Saccharide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 17%, #ffffff00 17%)">17%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c3">
<td class="regbutton NRPS r2c3">
<a href="#r2c3">Region 2.3</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">144,740</td>
<td class="digits table-split-left">193,599</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000368/1" target="_blank">streptobactin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(210, 105, 30, 0.3), rgba(210, 105, 30, 0.3) 70%, #ffffff00 70%)">70%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c4">
<td class="regbutton siderophore r2c4">
<a href="#r2c4">Region 2.4</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#siderophore" target="_blank">siderophore</a>
</td>
<td class="digits">347,862</td>
<td class="digits table-split-left">362,833</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001593/1" target="_blank">ficellomycin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 3%, #ffffff00 3%)">3%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c5">
<td class="regbutton lassopeptide r2c5">
<a href="#r2c5">Region 2.5</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#lassopeptide" target="_blank">lassopeptide</a>
</td>
<td class="digits">548,017</td>
<td class="digits table-split-left">570,561</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001435/1" target="_blank">ikarugamycin</a></td>
<td>NRP + Polyketide:Iterative type I</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c6">
<td class="regbutton NRPS r2c6">
<a href="#r2c6">Region 2.6</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">628,834</td>
<td class="digits table-split-left">683,050</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001117/1" target="_blank">himastatin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c7">
<td class="regbutton NRPS,terpene hybrid r2c7">
<a href="#r2c7">Region 2.7</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>,<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">1,043,511</td>
<td class="digits table-split-left">1,104,786</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0002024/1" target="_blank">nargenicin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 11%, #ffffff00 11%)">11%</td>
</tr>
</tbody>
</table>
代码片段
soup = BeautifulSoup(html, "lxml")
gdp_table = soup.find("table", attrs={"class": "region-table"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows
# Get all the headings of Lists
print ("Extracted {num} Region-Tables".format(num=len(gdp_table_data)))
print(gdp_table_data[0]) #print first table
print(gdp_table_data[1]) #print second table
理想情况下,我想输入 html 文件并提取所有不同的 tables 信息,合并为一个大 table 并可能输出为 csv .
从文件中提取 HTML 数据并导出单独的 csv。
import csv
from simplified_scrapy import SimplifiedDoc,req,utils
name = 'test.html'
html = utils.getFileContent(name) # Get data from file
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
如果您想每个 table
保留一个文件doc = SimplifiedDoc(html)
i=0
tables = doc.selects('table.region-table')
for table in tables:
i+=1
rows = []
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+str(i)+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
保留原来的比较。
import csv
from simplified_scrapy import SimplifiedDoc,req
html = '''''' # Your HTML
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
# If you have '>Region.*?</a>' in each row, you can get all the rows directly in the following way
# trs = doc.getElementsByReg('>Region.*?</a>',tag='tr')
# for tr in trs:
# rows.append([td.text for td in tr.tds])
with open('test.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
结果:
Region 1.1,NRPS-like,"21,469","62,957",phthoxazolin,NRP + Polyketide,4%
Region 1.2,NRPS,"74,163","124,963",nystatin,Polyketide,10%
Region 2.1,terpene,"3,800","23,263",ebelactone,Polyketide,5%
Region 2.2,NRPS-like,"55,320","97,088",indigoidine,Saccharide,17%
Region 2.3,NRPS,"144,740","193,599",streptobactin,NRP,70%
Region 2.4,siderophore,"347,862","362,833",ficellomycin,NRP,3%
Region 2.5,lassopeptide,"548,017","570,561",ikarugamycin,NRP + Polyketide:Iterative type I,12%
Region 2.6,NRPS,"628,834","683,050",himastatin,NRP,12%
Region 2.7,"NRPS,terpene","1,043,511","1,104,786",nargenicin,Polyketide,11%