Python、beautifulsoup 从统计数据中抓取特定或准确的数字 table

Question

如何将我的锚点设为“2014”并在 2014 列中抓取特定数字（刮掉 2014 右侧的数字）

下面的代码跳过了 "Passing" table（包含所有职业通过统计数据）并尝试从 "Rushing" table（包含所有职业生涯冲刺统计数据），以“2014”为锚，并在“2014”之后抓取接下来的五个标签（试图抓取 2014 年右侧的数字）。

我相信我的代码很接近，但我收到一条错误消息。

from bs4 import BeautifulSoup
import nltk 
from urllib import urlopen
import urllib
import re

url = 'http://www.nfl.com/player/tombrady/2504211/careerstats'
html = urlopen(url).read()
soup = BeautifulSoup(html)
table = soup.find("table", { "summary" : "Career Stats In Rushing For Tom Brady" })
for row in table.findAll("tr", { "td" : "2014" }):
    cells = row.findAll("td")
print cells[1]
print cells[2]
print cells[3]
print cells[4]
print cells[5]

这是错误信息：

Traceback (most recent call last):
File "C:\Users\jcmcdonald\Desktop\test7.py", line 17, in
print cells1
NameError: name 'cells' is not defined

第 17 行将是第一个 "print cells1"

Answer 1

我认为这更符合您的要求。你不能像你想做的那样过滤年份，你必须有一个 if 语句并自己过滤掉它。

from bs4 import BeautifulSoup
from urllib import urlopen

url = 'http://www.nfl.com/player/tombrady/2504211/careerstats'
html = urlopen(url).read()
soup = BeautifulSoup(html)
table = soup.find("table", { "summary" : "Career Stats In Rushing For Tom Brady" })
rows = []
for row in table.findAll("tr"):
    if '2014' in row.findNext('td'):
        for item in row.findAll("td"):
            rows.append(item.text)
print rows[6]

Answer 2

这应该可以帮助您入门：

from bs4 import BeautifulSoup
import nltk 
from urllib import urlopen
import urllib
import re

url = 'http://www.nfl.com/player/tombrady/2504211/careerstats'
html = urlopen(url).read()
soup = BeautifulSoup(html)
table = soup.find("table", { "summary" : "Career Stats In Rushing For Tom Brady" })
for row in table.find_all('tr'):
    print row.text

Python、beautifulsoup 从统计数据中抓取特定或准确的数字 table

Python, beautifulsoup scraping specific or exact numbers from a stat table

python

statistics

html-table

beautifulsoup

web-scraping