如何在抓取时跳过 table 中的第一个 header 行
How to skip the first header row in the table while scraping
我想从我抓取的数据中跳过第一个 header 行,我正在努力为此编写代码,我们将不胜感激。
到目前为止我想出的代码:
import csv
import urllib.request
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").read(),'lxml')
tbody = soup('table' ,{"class":"tollinfotbl"})[0].find_all('tr')
for row in tbody:
cols = row.findChildren(recursive=False)
cols = [ele.text.strip() for ele in cols]
这真的很糟糕而且过分了,但是这里是:
row_num = 0
for row in tbody:
if row_num > 0:
cols = row.findChildren(recursive=False)
cols = [ele.text.strip() for ele in cols]
row_num = row_num + 1
运行吧。你不会再有空括号了。
import urllib.request ; from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").read(),'lxml')
table = soup.find('table' ,{"class":"tollinfotbl"})
rows = [[ele.text.strip() for ele in item.find_all("td")]
for item in table.find_all("tr")]
for data in rows:
print(' '.join(data))
如果您愿意,还可以使用请求模块:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").text,'lxml')
titles = soup.select("table.tollinfotbl")[0]
list_row =[[tab_d.text.strip() for tab_d in item.select('td')]
for item in titles.select('tr')]
for data in list_row:
print(' '.join(data))
这是结果:
45.00 70.00 1565.00 25.00
75.00 115.00 2525.00 40.00
160.00 240.00 5290.00 80.00
175.00 260.00 5770.00 85.00
250.00 375.00 8295.00 125.00
250.00 375.00 8295.00 125.00
305.00 455.00 10100.00 150.00
我想从我抓取的数据中跳过第一个 header 行,我正在努力为此编写代码,我们将不胜感激。
到目前为止我想出的代码:
import csv
import urllib.request
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").read(),'lxml')
tbody = soup('table' ,{"class":"tollinfotbl"})[0].find_all('tr')
for row in tbody:
cols = row.findChildren(recursive=False)
cols = [ele.text.strip() for ele in cols]
这真的很糟糕而且过分了,但是这里是:
row_num = 0
for row in tbody:
if row_num > 0:
cols = row.findChildren(recursive=False)
cols = [ele.text.strip() for ele in cols]
row_num = row_num + 1
运行吧。你不会再有空括号了。
import urllib.request ; from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").read(),'lxml')
table = soup.find('table' ,{"class":"tollinfotbl"})
rows = [[ele.text.strip() for ele in item.find_all("td")]
for item in table.find_all("tr")]
for data in rows:
print(' '.join(data))
如果您愿意,还可以使用请求模块:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://tis.nhai.gov.in/TollInformation?TollPlazaID=236").text,'lxml')
titles = soup.select("table.tollinfotbl")[0]
list_row =[[tab_d.text.strip() for tab_d in item.select('td')]
for item in titles.select('tr')]
for data in list_row:
print(' '.join(data))
这是结果:
45.00 70.00 1565.00 25.00
75.00 115.00 2525.00 40.00
160.00 240.00 5290.00 80.00
175.00 260.00 5770.00 85.00
250.00 375.00 8295.00 125.00
250.00 375.00 8295.00 125.00
305.00 455.00 10100.00 150.00