在 Beautiful soup 中使用 soup.select('placeholder')[0].get_text() 时列出超出范围的错误
list out of range error when using soup.select('placeholder')[0].get_text() in Beautiful soup
抓取新手,我正在尝试使用 Beautiful soup 从维基百科页面获取轴距值(最终是其他东西)(我稍后会处理 robots.txt)This is the guide I've been using
两个问题
1.) 如何解决以下错误?
2.) 如何抓取包含轴距的单元格中的值是 "td#Wheelbase td" 吗?
我得到的错误是
File "evscraper.py", line 25, in <module>
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3') [0].get_text()
IndexError: list index out of range
感谢您的帮助!
__author__ = 'KirkLazarus'
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')[0].get_text()
print wheelbase_data
嗯,您的第一个问题是您的选择器。该页面上没有 ID 为 "Wheelbase" 的 div,因此它返回一个空列表。
下面的内容绝不是完美的,但会得到你想要的,只是因为你已经知道页面的结构:
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
wheelbase_data = {}
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
for link in soup.find_all('a'):
if link.get('href') == "/wiki/Wheelbase":
wheelbase = link
break
wheelbase_data['Wheelbase'] = wheelbase.parent.parent.td.text
您查找的路径似乎不正确。我过去不得不做类似的事情。我不确定这是否是最好的方法,但肯定对我有用。
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
car_data = pd.DataFrame()
models = ['Tesla_Model_S','Tesla_Model_X']
for model in models:
wiki = "https://en.wikipedia.org/wiki/{0}".format(model)
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "infobox hproduct" })
for row in table.findAll("tr")[2:]:
try:
field = row.findAll("th")[0].text.strip()
val = row.findAll("td")[0].text.strip()
car_data.set_value(model,field,val)
except:
pass
car_data
抓取新手,我正在尝试使用 Beautiful soup 从维基百科页面获取轴距值(最终是其他东西)(我稍后会处理 robots.txt)This is the guide I've been using
两个问题 1.) 如何解决以下错误? 2.) 如何抓取包含轴距的单元格中的值是 "td#Wheelbase td" 吗?
我得到的错误是
File "evscraper.py", line 25, in <module>
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3') [0].get_text()
IndexError: list index out of range
感谢您的帮助!
__author__ = 'KirkLazarus'
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')[0].get_text()
print wheelbase_data
嗯,您的第一个问题是您的选择器。该页面上没有 ID 为 "Wheelbase" 的 div,因此它返回一个空列表。
下面的内容绝不是完美的,但会得到你想要的,只是因为你已经知道页面的结构:
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
wheelbase_data = {}
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
for link in soup.find_all('a'):
if link.get('href') == "/wiki/Wheelbase":
wheelbase = link
break
wheelbase_data['Wheelbase'] = wheelbase.parent.parent.td.text
您查找的路径似乎不正确。我过去不得不做类似的事情。我不确定这是否是最好的方法,但肯定对我有用。
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
car_data = pd.DataFrame()
models = ['Tesla_Model_S','Tesla_Model_X']
for model in models:
wiki = "https://en.wikipedia.org/wiki/{0}".format(model)
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "infobox hproduct" })
for row in table.findAll("tr")[2:]:
try:
field = row.findAll("th")[0].text.strip()
val = row.findAll("td")[0].text.strip()
car_data.set_value(model,field,val)
except:
pass
car_data