Python 使用 BeautifulSoup 解析表
Python parsing tables with BeautifulSoup
HTML页面结构:
<table>
<tbody>
<tr>
<th>Timestamp</th>
<th>Call</th>
<th>MHz</th>
<th>SNR</th>
<th>Drift</th>
<th>Grid</th>
<th>Pwr</th>
<th>Reporter</th>
<th>RGrid</th>
<th>km</th>
<th>az</th>
</tr>
<tr>
<td align="right"> 2019-12-10 14:02 </td>
<td align="left"> DL1DUZ </td>
<td align="right"> 10.140271 </td>
<td align="right"> -26 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
<tr>
<td align="right"> 2019-10-10 14:02 </td>
<td align="left"> DL23UH </td>
<td align="right"> 11.0021 </td>
<td align="right"> -20 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
</tbody>
</table>
等等 tr-td...
我的代码:
from bs4 import BeautifulSoup as bs
import requests
import csv
base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []
if request.status_code == 200:
soup = bs(request.content, 'html.parser')
table = soup.findChildren('table')
tr = soup.findChildren('tr')
my_table = table[0]
my_tr_th = tr[0]
my_tr_td = tr[1]
rows = my_table.findChildren('tr')
row_th = my_tr_th.findChildren('th')
row_td = my_tr_td.findChildren('td')
for r_th in row_th:
heading = r_th.text
val_th.append(heading)
for r_td in row_td:
data = r_td.text
val_td.append(data)
with open('output.csv', 'w') as f:
a_pen = csv.writer(f)
a_pen.writerow(val_th)
a_pen.writerow(val_td)
1) 我打印了 1 行 td
。如何确保页面上td
的所有行都显示在csv中?
2) td
标签 - 页面上有很多。
3) 如果 my_tr_td = tr[1]
写成 my_tr_td = tr[1:50]
- 这是错误的。
如何将 tr-td
行的所有数据写入 csv 文件?
提前致谢。
让我们这样试试:
import lxml.html
import csv
import requests
url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)
doc = lxml.html.fromstring(res.text)
cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
typ = str(type(item.getnext()))
if not 'NoneType' in typ:
cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
inf.append(item.text.replace('\xa02', '').strip()) #text info needs to be cleaned
#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
#finally, write to file:
with open("output.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(cols)
for l in rows:
writer.writerow(l)
HTML页面结构:
<table>
<tbody>
<tr>
<th>Timestamp</th>
<th>Call</th>
<th>MHz</th>
<th>SNR</th>
<th>Drift</th>
<th>Grid</th>
<th>Pwr</th>
<th>Reporter</th>
<th>RGrid</th>
<th>km</th>
<th>az</th>
</tr>
<tr>
<td align="right"> 2019-12-10 14:02 </td>
<td align="left"> DL1DUZ </td>
<td align="right"> 10.140271 </td>
<td align="right"> -26 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
<tr>
<td align="right"> 2019-10-10 14:02 </td>
<td align="left"> DL23UH </td>
<td align="right"> 11.0021 </td>
<td align="right"> -20 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
</tbody>
</table>
等等 tr-td... 我的代码:
from bs4 import BeautifulSoup as bs
import requests
import csv
base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []
if request.status_code == 200:
soup = bs(request.content, 'html.parser')
table = soup.findChildren('table')
tr = soup.findChildren('tr')
my_table = table[0]
my_tr_th = tr[0]
my_tr_td = tr[1]
rows = my_table.findChildren('tr')
row_th = my_tr_th.findChildren('th')
row_td = my_tr_td.findChildren('td')
for r_th in row_th:
heading = r_th.text
val_th.append(heading)
for r_td in row_td:
data = r_td.text
val_td.append(data)
with open('output.csv', 'w') as f:
a_pen = csv.writer(f)
a_pen.writerow(val_th)
a_pen.writerow(val_td)
1) 我打印了 1 行 td
。如何确保页面上td
的所有行都显示在csv中?
2) td
标签 - 页面上有很多。
3) 如果 my_tr_td = tr[1]
写成 my_tr_td = tr[1:50]
- 这是错误的。
如何将 tr-td
行的所有数据写入 csv 文件?
提前致谢。
让我们这样试试:
import lxml.html
import csv
import requests
url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)
doc = lxml.html.fromstring(res.text)
cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
typ = str(type(item.getnext()))
if not 'NoneType' in typ:
cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
inf.append(item.text.replace('\xa02', '').strip()) #text info needs to be cleaned
#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
#finally, write to file:
with open("output.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(cols)
for l in rows:
writer.writerow(l)