Python Beautiful Soup 从图表中抓取准确内容

Question

在 python using beautiful soup 我希望能够从 sortable table 在线抓取特定文本 <a>/numbers<td>。

http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go

我已经尝试了大约一百万次，但无法弄清楚。

这是我能做的最好的了：

from bs4 import BeautifulSoup
import urllib2
import requests
import pymongo
import re

soup = BeautifulSoup(urllib2.urlopen('http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go').read())

find = soup('a', text="Miami Dolphins")

print find

我不知道如何在迈阿密海豚队之后 find/call 第 10 个（python 中的第 9 个）< td > 标记。

table 代码如下所示：

<table id="result" style="width:100%" cellpadding="0" class"data-table1"
cellspacing="0">
   <caption class="thd1">...</caption>
   <tbody>...</tbody>
   <tbody>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">...</tr>
      <tr class="odd">...</tr>
      <tr class="even">
         <td>14</td>
         <td>
            <a href="/teams/miamidolphins/profile?team=MIA onclick=
            "s_objectID="http://www.nfl.com/teams.miamidolphins/profile?
            team=MIA_1";return this.s_oc?this.s_oc(e):true">Miami Dolphins</a>    *********I want to grab team name**********
         </td>
         <td>

         16

         </td>
         <td>24.2</td>
         <td>

         388

         </td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="sorted right">...</td>
           "


           14  ****I want to grab 10th number/<td> tag after team name****


                                        "
         </td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td>
         <td class="right">...</td
      </tr>

Answer 1

试试这个

import urllib2
from lxml import etree

url = 'http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go'
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()

tree = etree.parse(response,htmlparser)

text = tree.xpath('//a[contains(text(),"Miami Dolphins")]/parent::td/following-sibling::td[10]/text()')
if text:
    print text[0].strip()

Python Beautiful Soup 从图表中抓取准确内容

Python Beautiful Soup Scraping Exact Content From Charts

python

html-table

beautifulsoup

web-scraping