BeautifulSoup 返回无关 HTML

BeautifulSoup returning unrelated HTML

我正在尝试从 http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html 等页面解析篮球统计数据。我正在使用 Python 2.7.6 和 BeautifulSoup 4-4.3.2。我正在搜索 class "sortable" 之类的游戏日志,以便访问表格中包含的原始统计数据。我只对每个团队的 "Basic Stats" 感兴趣。

然而,BeautifulSoup 返回的 HTML 完全不是我所期望的。相反,我得到了每所参加过比赛的学校的历史球队记录和数据列表。我没有足够的声誉 post 输出的第二个 link 或者我会。

基本上,boxscore 页面上有四个 class "sortable" 表。当我要求 BS 以我能想到的将它们与其他数据区分开来的唯一方法找到它们时,它反而 returns 完全不相关的数据,我什至无法弄清楚返回的数据来自哪里。

代码如下:

import urllib2
import re
import sys
from bs4 import BeautifulSoup

class Gamelogs():

    def __init__(self):

        #the base bage that has all boxscore links
        self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
        'http://www.sports-reference.com/cbb/schools/' + school +
        '/2015-gamelogs.html'))
        #use regex to only find links with score data       
        self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
        "boxscores"));

def scoredata(links, school):
    #for each link in the school's season   
    for l in links:
        gameSoup = BeautifulSoup(urllib2.urlopen(l))
        #remove extra link formatting to get just filename alone
        l = l[59+len(school):]
        #open a local file with that filename to store the results
        fo = open(str(l),"w")
        #create a list that will hold the box score data only   
        output = gameSoup.findAll(class_="sortable")
        #write it line by line to the file that was just opened 
        for o in output:
            fo.write(str(o) + '\n')
        fo.close

def getlinks(school):
    gamelogs = Gamelogs()
    #open a new file to store the output
    fo = open(school + '.txt',"w")
    #remove extraneous links
    gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
    #create the list that will hold each school's seasonlong boxscores
    boxlinks = list()
    for s in gamelogs.statusPageLinks:
        #make the list element a string so it can be sliced
        string = str(s)
        #remove extra link formatting
        string = string[9:]
        string = string[:-16]
        #create the full list of games per school
        boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
        + school + string)
    scoredata(boxlinks, school)     

if __name__ == '__main__':
    #for each school as a commandline argument  
    for arg in sys.argv[1:]:
        school = arg    
        getlinks(school)

这是废话、我的代码还是网站的问题? T

您的代码似乎存在问题。您返回的页面听起来像这样:http://www.sports-reference.com/cbb/schools/?redir

每当我输入无效的学校名称时,我都会被重定向到一个显示 477 个不同球队统计数据的页面。仅供参考:url 中的团队名称也区分大小写。