BeautifulSoup 返回无关 HTML
BeautifulSoup returning unrelated HTML
我正在尝试从 http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html 等页面解析篮球统计数据。我正在使用 Python 2.7.6 和 BeautifulSoup 4-4.3.2。我正在搜索 class "sortable" 之类的游戏日志,以便访问表格中包含的原始统计数据。我只对每个团队的 "Basic Stats" 感兴趣。
然而,BeautifulSoup 返回的 HTML 完全不是我所期望的。相反,我得到了每所参加过比赛的学校的历史球队记录和数据列表。我没有足够的声誉 post 输出的第二个 link 或者我会。
基本上,boxscore 页面上有四个 class "sortable" 表。当我要求 BS 以我能想到的将它们与其他数据区分开来的唯一方法找到它们时,它反而 returns 完全不相关的数据,我什至无法弄清楚返回的数据来自哪里。
代码如下:
import urllib2
import re
import sys
from bs4 import BeautifulSoup
class Gamelogs():
def __init__(self):
#the base bage that has all boxscore links
self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
'http://www.sports-reference.com/cbb/schools/' + school +
'/2015-gamelogs.html'))
#use regex to only find links with score data
self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
"boxscores"));
def scoredata(links, school):
#for each link in the school's season
for l in links:
gameSoup = BeautifulSoup(urllib2.urlopen(l))
#remove extra link formatting to get just filename alone
l = l[59+len(school):]
#open a local file with that filename to store the results
fo = open(str(l),"w")
#create a list that will hold the box score data only
output = gameSoup.findAll(class_="sortable")
#write it line by line to the file that was just opened
for o in output:
fo.write(str(o) + '\n')
fo.close
def getlinks(school):
gamelogs = Gamelogs()
#open a new file to store the output
fo = open(school + '.txt',"w")
#remove extraneous links
gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
#create the list that will hold each school's seasonlong boxscores
boxlinks = list()
for s in gamelogs.statusPageLinks:
#make the list element a string so it can be sliced
string = str(s)
#remove extra link formatting
string = string[9:]
string = string[:-16]
#create the full list of games per school
boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
+ school + string)
scoredata(boxlinks, school)
if __name__ == '__main__':
#for each school as a commandline argument
for arg in sys.argv[1:]:
school = arg
getlinks(school)
这是废话、我的代码还是网站的问题? T
您的代码似乎存在问题。您返回的页面听起来像这样:http://www.sports-reference.com/cbb/schools/?redir
每当我输入无效的学校名称时,我都会被重定向到一个显示 477 个不同球队统计数据的页面。仅供参考:url 中的团队名称也区分大小写。
我正在尝试从 http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html 等页面解析篮球统计数据。我正在使用 Python 2.7.6 和 BeautifulSoup 4-4.3.2。我正在搜索 class "sortable" 之类的游戏日志,以便访问表格中包含的原始统计数据。我只对每个团队的 "Basic Stats" 感兴趣。
然而,BeautifulSoup 返回的 HTML 完全不是我所期望的。相反,我得到了每所参加过比赛的学校的历史球队记录和数据列表。我没有足够的声誉 post 输出的第二个 link 或者我会。
基本上,boxscore 页面上有四个 class "sortable" 表。当我要求 BS 以我能想到的将它们与其他数据区分开来的唯一方法找到它们时,它反而 returns 完全不相关的数据,我什至无法弄清楚返回的数据来自哪里。
代码如下:
import urllib2
import re
import sys
from bs4 import BeautifulSoup
class Gamelogs():
def __init__(self):
#the base bage that has all boxscore links
self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
'http://www.sports-reference.com/cbb/schools/' + school +
'/2015-gamelogs.html'))
#use regex to only find links with score data
self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
"boxscores"));
def scoredata(links, school):
#for each link in the school's season
for l in links:
gameSoup = BeautifulSoup(urllib2.urlopen(l))
#remove extra link formatting to get just filename alone
l = l[59+len(school):]
#open a local file with that filename to store the results
fo = open(str(l),"w")
#create a list that will hold the box score data only
output = gameSoup.findAll(class_="sortable")
#write it line by line to the file that was just opened
for o in output:
fo.write(str(o) + '\n')
fo.close
def getlinks(school):
gamelogs = Gamelogs()
#open a new file to store the output
fo = open(school + '.txt',"w")
#remove extraneous links
gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
#create the list that will hold each school's seasonlong boxscores
boxlinks = list()
for s in gamelogs.statusPageLinks:
#make the list element a string so it can be sliced
string = str(s)
#remove extra link formatting
string = string[9:]
string = string[:-16]
#create the full list of games per school
boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
+ school + string)
scoredata(boxlinks, school)
if __name__ == '__main__':
#for each school as a commandline argument
for arg in sys.argv[1:]:
school = arg
getlinks(school)
这是废话、我的代码还是网站的问题? T
您的代码似乎存在问题。您返回的页面听起来像这样:http://www.sports-reference.com/cbb/schools/?redir
每当我输入无效的学校名称时,我都会被重定向到一个显示 477 个不同球队统计数据的页面。仅供参考:url 中的团队名称也区分大小写。