Python Beautiful Soup 从图表中抓取准确内容
Python Beautiful Soup Scraping Exact Content From Charts
在 python using beautiful soup 我希望能够从 sortable table 在线抓取特定文本 <a>/numbers<td>
。
我已经尝试了大约一百万次,但无法弄清楚。
这是我能做的最好的了:
from bs4 import BeautifulSoup
import urllib2
import requests
import pymongo
import re
soup = BeautifulSoup(urllib2.urlopen('http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go').read())
find = soup('a', text="Miami Dolphins")
print find
我不知道如何在迈阿密海豚队之后 find/call 第 10 个(python 中的第 9 个)< td > 标记。
table 代码如下所示:
<table id="result" style="width:100%" cellpadding="0" class"data-table1"
cellspacing="0">
<caption class="thd1">...</caption>
<tbody>...</tbody>
<tbody>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">
<td>14</td>
<td>
<a href="/teams/miamidolphins/profile?team=MIA onclick=
"s_objectID="http://www.nfl.com/teams.miamidolphins/profile?
team=MIA_1";return this.s_oc?this.s_oc(e):true">Miami Dolphins</a> *********I want to grab team name**********
</td>
<td>
16
</td>
<td>24.2</td>
<td>
388
</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="sorted right">...</td>
"
14 ****I want to grab 10th number/<td> tag after team name****
"
</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td
</tr>
试试这个
import urllib2
from lxml import etree
url = 'http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go'
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response,htmlparser)
text = tree.xpath('//a[contains(text(),"Miami Dolphins")]/parent::td/following-sibling::td[10]/text()')
if text:
print text[0].strip()
在 python using beautiful soup 我希望能够从 sortable table 在线抓取特定文本 <a>/numbers<td>
。
我已经尝试了大约一百万次,但无法弄清楚。
这是我能做的最好的了:
from bs4 import BeautifulSoup
import urllib2
import requests
import pymongo
import re
soup = BeautifulSoup(urllib2.urlopen('http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go').read())
find = soup('a', text="Miami Dolphins")
print find
我不知道如何在迈阿密海豚队之后 find/call 第 10 个(python 中的第 9 个)< td > 标记。
table 代码如下所示:
<table id="result" style="width:100%" cellpadding="0" class"data-table1"
cellspacing="0">
<caption class="thd1">...</caption>
<tbody>...</tbody>
<tbody>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">...</tr>
<tr class="odd">...</tr>
<tr class="even">
<td>14</td>
<td>
<a href="/teams/miamidolphins/profile?team=MIA onclick=
"s_objectID="http://www.nfl.com/teams.miamidolphins/profile?
team=MIA_1";return this.s_oc?this.s_oc(e):true">Miami Dolphins</a> *********I want to grab team name**********
</td>
<td>
16
</td>
<td>24.2</td>
<td>
388
</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="sorted right">...</td>
"
14 ****I want to grab 10th number/<td> tag after team name****
"
</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td>
<td class="right">...</td
</tr>
试试这个
import urllib2
from lxml import etree
url = 'http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go'
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response,htmlparser)
text = tree.xpath('//a[contains(text(),"Miami Dolphins")]/parent::td/following-sibling::td[10]/text()')
if text:
print text[0].strip()