BeautifulSoup 和 CSV 文件
BeautifulSoup and CSV files
我想从 http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# 中提取 table 并将所有信息放入 csv 文件中。
我已经这样做了,但遇到了一些问题。 table 的第一列包含玩家的排名和他们的名字。我想把它们分开,这样一列只包含排名,另一列包含玩家姓名。
代码如下:
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
csvwriter.writerow(cells)
以下是一些玩家的输出:
"1
Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46%
"2
Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33%
"3
Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0%
"4
Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38%
"5
Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
如您所见,第一列显示不正确,数字位于比其余数据更高的行上,而且差距非常大。
问题列的 HTML 代码比其余列稍微复杂一些:
<td class="col1" rel="1">1
<a href="/Tennis/Players/Top-Players/Novak-Djokovic.aspx">Novak Djokovic</a></td>
我尝试将它与那个分开,但我无法让它工作,我认为修复当前的 CSV 文件可能更容易。
拉出后分离字段非常容易。你有一个数字、一堆空格和一个名字。所以只需使用 split
,使用默认分隔符,最大拆分为 1:
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0:1] = cells[0].split(None, 1)
csvwriter.writerow(cells)
但你也可以将它从汤中分离出来,这可能更坚固:
cells = row.find_all('td')
cell0 = cells.pop(0)
rank = next(cell0.children).strip().encode('utf-8')
name = cell0.find('a').text.encode('utf-8')
cells = [rank, name] + [c.text.encode('utf-8') for c in cells]
由于您关注的值包含多个选项卡并且玩家的名字紧跟在最后一个选项卡之后,我建议按选项卡拆分并从生成的元组中收集最后一项。
我添加的行是cells[0] = cells[0].split('\t')[-1]
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0] = cells[0].split('\t')[-1]
csvwriter.writerow(cells)
f.close()
我想从 http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# 中提取 table 并将所有信息放入 csv 文件中。
我已经这样做了,但遇到了一些问题。 table 的第一列包含玩家的排名和他们的名字。我想把它们分开,这样一列只包含排名,另一列包含玩家姓名。
代码如下:
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
csvwriter.writerow(cells)
以下是一些玩家的输出:
"1
Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46%
"2
Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33%
"3
Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0%
"4
Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38%
"5
Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
如您所见,第一列显示不正确,数字位于比其余数据更高的行上,而且差距非常大。
问题列的 HTML 代码比其余列稍微复杂一些:
<td class="col1" rel="1">1
<a href="/Tennis/Players/Top-Players/Novak-Djokovic.aspx">Novak Djokovic</a></td>
我尝试将它与那个分开,但我无法让它工作,我认为修复当前的 CSV 文件可能更容易。
拉出后分离字段非常容易。你有一个数字、一堆空格和一个名字。所以只需使用 split
,使用默认分隔符,最大拆分为 1:
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0:1] = cells[0].split(None, 1)
csvwriter.writerow(cells)
但你也可以将它从汤中分离出来,这可能更坚固:
cells = row.find_all('td')
cell0 = cells.pop(0)
rank = next(cell0.children).strip().encode('utf-8')
name = cell0.find('a').text.encode('utf-8')
cells = [rank, name] + [c.text.encode('utf-8') for c in cells]
由于您关注的值包含多个选项卡并且玩家的名字紧跟在最后一个选项卡之后,我建议按选项卡拆分并从生成的元组中收集最后一项。
我添加的行是cells[0] = cells[0].split('\t')[-1]
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0] = cells[0].split('\t')[-1]
csvwriter.writerow(cells)
f.close()