Python Beautiful Soup Table 数据抓取特定 TD 标签
Python Beautiful Soup Table Data Scraping Specific TD Tags
此网页上有多个 table:http://www.nfl.com/player/tombrady/2504211/gamelogs。
在 HTML 中,所有 table 都被标记为完全相同:
<table class="data-table1" width="100%" border="0" summary="Game Logs For Tom Brady In 2014">
我只能从第一个 table(季前赛 table)中抓取数据,但我不知道如何跳过第一个 table(季前赛)并从第二个中抓取数据第三 table 秒(常规赛季和 Post 赛季)。
我正在尝试抓取特定数字。
我的代码:
import pickle
import math
import urllib2
from lxml import etree
from bs4 import BeautifulSoup
from urllib import urlopen
year = '2014'
lastWeek = '2'
favQB1 = "Tom Brady"
favQBurl2 = 'http://www.nfl.com/player/tombrady/2504211/gamelogs'
favQBhtml2 = urlopen(favQBurl2).read()
favQBsoup2 = BeautifulSoup(favQBhtml2)
favQBpass2 = favQBsoup2.find("table", { "summary" : "Game Logs For %s In %s" % (favQB1, year)})
favQBrows2 = []
for row in favQBpass2.findAll("tr"):
if lastWeek in row.findNext('td'):
for item in row.findAll("td"):
favQBrows2.append(item.text)
print ("Enter: Starting Quarterback QB Rating of Favored Team for the last game played (regular season): "),
print favQBrows2[15]
依靠 table 标题,它位于第一 table 行的 td
元素中:
def find_table(soup, label):
return soup.find("td", text=label).find_parent("table", summary=True)
用法:
find_table(soup, "Preseason")
find_table(soup, "Regular Season")
find_table(soup, "Postseason")
仅供参考,find_parent()
文档参考。
以下应该也有效 -
import pickle
import math
import urllib2
from lxml import etree
from bs4 import BeautifulSoup
from urllib import urlopen
year = '2014'
lastWeek = '2'
favQB1 = "Tom Brady"
favQBurl2 = 'http://www.nfl.com/player/tombrady/2504211/gamelogs'
favQBhtml2 = urlopen(favQBurl2).read()
favQBsoup2 = BeautifulSoup(favQBhtml2)
favQBpass2 = favQBsoup2.find_all("table", { "summary" : "Game Logs For %s In %s" % (favQB1, year)})[1]
favQBrows2 = []
for row in favQBpass2.findAll("tr"):
if lastWeek in row.findNext('td'):
for item in row.findAll("td"):
favQBrows2.append(item.text)
print ("Enter: Starting Quarterback QB Rating of Favored Team for the last game played (regular season): "),
print favQBrows2[15]
此网页上有多个 table:http://www.nfl.com/player/tombrady/2504211/gamelogs。
在 HTML 中,所有 table 都被标记为完全相同:
<table class="data-table1" width="100%" border="0" summary="Game Logs For Tom Brady In 2014">
我只能从第一个 table(季前赛 table)中抓取数据,但我不知道如何跳过第一个 table(季前赛)并从第二个中抓取数据第三 table 秒(常规赛季和 Post 赛季)。
我正在尝试抓取特定数字。
我的代码:
import pickle
import math
import urllib2
from lxml import etree
from bs4 import BeautifulSoup
from urllib import urlopen
year = '2014'
lastWeek = '2'
favQB1 = "Tom Brady"
favQBurl2 = 'http://www.nfl.com/player/tombrady/2504211/gamelogs'
favQBhtml2 = urlopen(favQBurl2).read()
favQBsoup2 = BeautifulSoup(favQBhtml2)
favQBpass2 = favQBsoup2.find("table", { "summary" : "Game Logs For %s In %s" % (favQB1, year)})
favQBrows2 = []
for row in favQBpass2.findAll("tr"):
if lastWeek in row.findNext('td'):
for item in row.findAll("td"):
favQBrows2.append(item.text)
print ("Enter: Starting Quarterback QB Rating of Favored Team for the last game played (regular season): "),
print favQBrows2[15]
依靠 table 标题,它位于第一 table 行的 td
元素中:
def find_table(soup, label):
return soup.find("td", text=label).find_parent("table", summary=True)
用法:
find_table(soup, "Preseason")
find_table(soup, "Regular Season")
find_table(soup, "Postseason")
仅供参考,find_parent()
文档参考。
以下应该也有效 -
import pickle
import math
import urllib2
from lxml import etree
from bs4 import BeautifulSoup
from urllib import urlopen
year = '2014'
lastWeek = '2'
favQB1 = "Tom Brady"
favQBurl2 = 'http://www.nfl.com/player/tombrady/2504211/gamelogs'
favQBhtml2 = urlopen(favQBurl2).read()
favQBsoup2 = BeautifulSoup(favQBhtml2)
favQBpass2 = favQBsoup2.find_all("table", { "summary" : "Game Logs For %s In %s" % (favQB1, year)})[1]
favQBrows2 = []
for row in favQBpass2.findAll("tr"):
if lastWeek in row.findNext('td'):
for item in row.findAll("td"):
favQBrows2.append(item.text)
print ("Enter: Starting Quarterback QB Rating of Favored Team for the last game played (regular season): "),
print favQBrows2[15]