Beautiful Soup - 如何清理提取数据?
Beautiful Soup - how to clean extracting data?
我的问题真的很琐碎,但作为 Python 的初学者,我仍然找不到答案..
我使用以下代码从网络中提取一些数据:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://walterfootball.com/fantasycheatsheet/2015/traditional")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
f = open('output.txt', 'w')
nfl = soup.findAll('li', "player")
lines = [span.get_text(strip=True) for span in nfl]
lines = str(lines)
f.write(lines)
f.close()
但是输出一团糟。
有没有一种优雅的方法可以得到这样的结果?
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
...
Just use str.join
on the list and .rstrip("+")
the +
off:
nfl = soup.findAll('li', "player")
lines = ("{}. {}\n".format(ind,span.get_text(strip=True).rstrip("+"))
for ind, span in enumerate(nfl,1))
print("".join(lines))
哪个会给你:
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9
..................
要分隔价格我们可以拆分或使用re.sub
在美元符号前添加一个space并写下每一行:
import re
with open('output.txt', 'w') as f:
for line in lines:
line = re.sub("($\d+)$", r" ", line, 1)
f.write(line)
现在输出是:
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9
您可以对 str.rsplit
执行相同的操作,在 $
上拆分一次并使用 space 重新加入:
with open('output.txt', 'w') as f:
for line in lines:
line,p = line.rsplit("$",1)
f.write("{} ${}".format(line,p))
遍历列表 lines
并写下每一行:
for num, line in enumerate(lines, 1):
f.write('{}. {}\n'.format(num, line))
enumerate
用于得到(num, line)
对。
顺便说一句,你最好使用 with
语句而不是手动关闭文件对象:
with open('output.txt', 'w') as f:
for num, line in enumerate(lines, 1):
f.write('{}. {}\n'.format(num, line))
我的问题真的很琐碎,但作为 Python 的初学者,我仍然找不到答案..
我使用以下代码从网络中提取一些数据:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://walterfootball.com/fantasycheatsheet/2015/traditional")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
f = open('output.txt', 'w')
nfl = soup.findAll('li', "player")
lines = [span.get_text(strip=True) for span in nfl]
lines = str(lines)
f.write(lines)
f.close()
但是输出一团糟。
有没有一种优雅的方法可以得到这样的结果?
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
...
Just use str.join
on the list and .rstrip("+")
the +
off:
nfl = soup.findAll('li', "player")
lines = ("{}. {}\n".format(ind,span.get_text(strip=True).rstrip("+"))
for ind, span in enumerate(nfl,1))
print("".join(lines))
哪个会给你:
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9
..................
要分隔价格我们可以拆分或使用re.sub
在美元符号前添加一个space并写下每一行:
import re
with open('output.txt', 'w') as f:
for line in lines:
line = re.sub("($\d+)$", r" ", line, 1)
f.write(line)
现在输出是:
1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9
您可以对 str.rsplit
执行相同的操作,在 $
上拆分一次并使用 space 重新加入:
with open('output.txt', 'w') as f:
for line in lines:
line,p = line.rsplit("$",1)
f.write("{} ${}".format(line,p))
遍历列表 lines
并写下每一行:
for num, line in enumerate(lines, 1):
f.write('{}. {}\n'.format(num, line))
enumerate
用于得到(num, line)
对。
顺便说一句,你最好使用 with
语句而不是手动关闭文件对象:
with open('output.txt', 'w') as f:
for num, line in enumerate(lines, 1):
f.write('{}. {}\n'.format(num, line))