BeautifulSoup 返回无关 HTML

Question

我正在尝试从 http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html 等页面解析篮球统计数据。我正在使用 Python 2.7.6 和 BeautifulSoup 4-4.3.2。我正在搜索 class "sortable" 之类的游戏日志，以便访问表格中包含的原始统计数据。我只对每个团队的 "Basic Stats" 感兴趣。

然而，BeautifulSoup 返回的 HTML 完全不是我所期望的。相反，我得到了每所参加过比赛的学校的历史球队记录和数据列表。我没有足够的声誉 post 输出的第二个 link 或者我会。

基本上，boxscore 页面上有四个 class "sortable" 表。当我要求 BS 以我能想到的将它们与其他数据区分开来的唯一方法找到它们时，它反而 returns 完全不相关的数据，我什至无法弄清楚返回的数据来自哪里。

代码如下：

import urllib2
import re
import sys
from bs4 import BeautifulSoup

class Gamelogs():

    def __init__(self):

        #the base bage that has all boxscore links
        self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
        'http://www.sports-reference.com/cbb/schools/' + school +
        '/2015-gamelogs.html'))
        #use regex to only find links with score data       
        self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
        "boxscores"));

def scoredata(links, school):
    #for each link in the school's season   
    for l in links:
        gameSoup = BeautifulSoup(urllib2.urlopen(l))
        #remove extra link formatting to get just filename alone
        l = l[59+len(school):]
        #open a local file with that filename to store the results
        fo = open(str(l),"w")
        #create a list that will hold the box score data only   
        output = gameSoup.findAll(class_="sortable")
        #write it line by line to the file that was just opened 
        for o in output:
            fo.write(str(o) + '\n')
        fo.close

def getlinks(school):
    gamelogs = Gamelogs()
    #open a new file to store the output
    fo = open(school + '.txt',"w")
    #remove extraneous links
    gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
    #create the list that will hold each school's seasonlong boxscores
    boxlinks = list()
    for s in gamelogs.statusPageLinks:
        #make the list element a string so it can be sliced
        string = str(s)
        #remove extra link formatting
        string = string[9:]
        string = string[:-16]
        #create the full list of games per school
        boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
        + school + string)
    scoredata(boxlinks, school)     

if __name__ == '__main__':
    #for each school as a commandline argument  
    for arg in sys.argv[1:]:
        school = arg    
        getlinks(school)

这是废话、我的代码还是网站的问题？ T

Answer 1

您的代码似乎存在问题。您返回的页面听起来像这样：http://www.sports-reference.com/cbb/schools/?redir

每当我输入无效的学校名称时，我都会被重定向到一个显示 477 个不同球队统计数据的页面。仅供参考：url 中的团队名称也区分大小写。

BeautifulSoup 返回无关 HTML

BeautifulSoup returning unrelated HTML

python

parsing

html-table

beautifulsoup

web-scraping