Beautiful Soup - 如何清理提取数据?

Beautiful Soup - how to clean extracting data?

我的问题真的很琐碎,但作为 Python 的初学者,我仍然找不到答案..

我使用以下代码从网络中提取一些数据:

from bs4 import BeautifulSoup
import urllib2

teams = ("http://walterfootball.com/fantasycheatsheet/2015/traditional")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")

f = open('output.txt', 'w')

nfl = soup.findAll('li', "player")
lines = [span.get_text(strip=True) for span in nfl]

lines = str(lines)
f.write(lines)
f.close()

但是输出一团糟。

有没有一种优雅的方法可以得到这样的结果?

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 
...

Just use str.join on the list and .rstrip("+") the + off:

nfl = soup.findAll('li', "player")
lines = ("{}. {}\n".format(ind,span.get_text(strip=True).rstrip("+"))
         for ind, span in enumerate(nfl,1))
print("".join(lines))

哪个会给你:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9
..................

要分隔价格我们可以拆分或使用re.sub在美元符号前添加一个space并写下每一行:

import re
with open('output.txt', 'w') as f:
    for line in lines:
        line = re.sub("($\d+)$", r" ", line, 1)
        f.write(line)

现在输出是:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5 
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9 

您可以对 str.rsplit 执行相同的操作,在 $ 上拆分一次并使用 space 重新加入:

with open('output.txt', 'w') as f:
    for line in lines:
        line,p = line.rsplit("$",1)
        f.write("{} ${}".format(line,p))

遍历列表 lines 并写下每一行:

for num, line in enumerate(lines, 1):
    f.write('{}. {}\n'.format(num, line))

enumerate用于得到(num, line)对。

顺便说一句,你最好使用 with 语句而不是手动关闭文件对象:

with open('output.txt', 'w') as f:
    for num, line in enumerate(lines, 1):
        f.write('{}. {}\n'.format(num, line))