TypeError: string indices must be integers when trying to print a href
TypeError: string indices must be integers when trying to print a href
我正在尝试从该站点的 25 个链接中抓取详细信息:
https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1
'/company/08569390'
是底层 html 代码中的一个 href 标签,所以本质上我是在尝试连接 base_url
('https://beta.companieshouse.gov.uk/)和 href 中的文本,这样我就可以循环遍历 25 页。
我的代码(下方)给我消息 TypeError: string indices must be integers
。
有人可以向我解释一下我哪里出错了吗?我是否需要将 href 的内容转换为整数,即使它还包含一些文本 (/company/
)?
import requests
from bs4 import BeautifulSoup
import csv
base_url = 'https://beta.companieshouse.gov.uk/'
header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en-US,en;q=0.8,fr;q=0.6',
'Connection':'keep-alive',
'Cookie':'mdtp=y4Ts2Vvql5V9MMZNjqB9T+7S/vkQKPqjHHMIq5jk0J1l5l131dU0YXsq7Rr15GDyghKHrS/qcD2vdsMCVtzKByJEDZFI+roS6tN9FN5IS70q8PkCCBjgFPDZjlR1A3H9FJ/zCWXMNJbaXqF8MgqE+nhR3/lji+eK4mm/GP9b8oxlVdupo9KN9SKanxu/JFEyNXutjyN+BsxRztNem1Z+ExSQCojyxflI/tc70+bXAu3/ppdP7fIXixfEOAWezmOh3ywchn9DV7Af8wH45t8u4+Y=; mdtpdi=mdtpdi#f523cd04-e09e-48bc-9977-73f974d50cea#1484041095424_zXDAuNhEkKdpRUsfXt+/1g==; seen_cookie_message=yes; _ga=GA1.4.666959744.1484041122; _gat=1',
'Host':'https://beta.companieshouse.gov.uk/',
#'Referer':'https://beta.companieshouse.gov.uk/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.51 Safari/537.36'
}
session = requests.session()
url = 'https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1'
response = session.get(url, headers=header)
soup = BeautifulSoup(response.content,"lxml")
rslt_table = soup.find("article")
for elem in rslt_table:
det_url = base_url+elem['href']
print det_url
我试了一下你的代码,最终解决了你的问题。我所做的更改是:
links=[]
headers=soup.findAll("h3")
for header in headers:
det_url = base_url+header.find('a')['href']
links.append(det_url)
print det_url
print links
我得到的输出是:
['https://beta.companieshouse.gov.uk//company/08569390', 'https://beta.companieshouse.gov.uk//company/09947251', 'https://beta.companieshouse.gov.uk//company/07352770', 'https://beta.companieshouse.gov.uk//company/07908180', 'https://beta.companieshouse.gov.uk//company/04576887', 'https://beta.companieshouse.gov.uk//company/08760943', 'https://beta.companieshouse.gov.uk//company/08265394', 'https://beta.companieshouse.gov.uk//company/03893510', 'https://beta.companieshouse.gov.uk//company/07422059', 'https://beta.companieshouse.gov.uk//company/08819027', 'https://beta.companieshouse.gov.uk//company/08325123', 'https://beta.companieshouse.gov.uk//company/09669365', 'https://beta.companieshouse.gov.uk//company/08641990', 'https://beta.companieshouse.gov.uk//company/06318392', 'https://beta.companieshouse.gov.uk//company/09400775', 'https://beta.companieshouse.gov.uk//company/01930797', 'https://beta.companieshouse.gov.uk//company/09398542', 'https://beta.companieshouse.gov.uk//company/07784981', 'https://beta.companieshouse.gov.uk//company/07480763', 'https://beta.companieshouse.gov.uk//company/06971238']
这一行:
rslt_table = soup.find("article")
returns 你一个 article
元素。当您这样做时:
for elem in rslt_table:
您正在遍历 article
的每个元素,因为它们是纯文本形式。因此 elem
是一个字符串,不能被另一个字符串索引,就像您尝试使用 elem["href"]
一样。您想要做的是在 rslt_table
:
中获取 a
元素,而不是字符串
for elem in rslt_table.find_all("a"):
改变这一行会给你你想要的。
soup.find("article")
不是您定位所有这些公司标签的方式,请尝试使用 find_all
代替:
base_url = 'https://beta.companieshouse.gov.uk'
companies = soup.find_all('a', {'title': 'View company'}) # to get all company <a> tags
for company in companies:
det_url = base_url+elem['href']
print det_url
如果您正在寻找特定邮政编码的公司,您可能更愿意下载此数据集而不是抓取:http://download.companieshouse.gov.uk/en_output.html
Companies House 还提供了一个 API,您可能会发现它很有用:https://developer.companieshouse.gov.uk/api/docs/
我正在尝试从该站点的 25 个链接中抓取详细信息: https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1
'/company/08569390'
是底层 html 代码中的一个 href 标签,所以本质上我是在尝试连接 base_url
('https://beta.companieshouse.gov.uk/)和 href 中的文本,这样我就可以循环遍历 25 页。
我的代码(下方)给我消息 TypeError: string indices must be integers
。
有人可以向我解释一下我哪里出错了吗?我是否需要将 href 的内容转换为整数,即使它还包含一些文本 (/company/
)?
import requests
from bs4 import BeautifulSoup
import csv
base_url = 'https://beta.companieshouse.gov.uk/'
header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en-US,en;q=0.8,fr;q=0.6',
'Connection':'keep-alive',
'Cookie':'mdtp=y4Ts2Vvql5V9MMZNjqB9T+7S/vkQKPqjHHMIq5jk0J1l5l131dU0YXsq7Rr15GDyghKHrS/qcD2vdsMCVtzKByJEDZFI+roS6tN9FN5IS70q8PkCCBjgFPDZjlR1A3H9FJ/zCWXMNJbaXqF8MgqE+nhR3/lji+eK4mm/GP9b8oxlVdupo9KN9SKanxu/JFEyNXutjyN+BsxRztNem1Z+ExSQCojyxflI/tc70+bXAu3/ppdP7fIXixfEOAWezmOh3ywchn9DV7Af8wH45t8u4+Y=; mdtpdi=mdtpdi#f523cd04-e09e-48bc-9977-73f974d50cea#1484041095424_zXDAuNhEkKdpRUsfXt+/1g==; seen_cookie_message=yes; _ga=GA1.4.666959744.1484041122; _gat=1',
'Host':'https://beta.companieshouse.gov.uk/',
#'Referer':'https://beta.companieshouse.gov.uk/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.51 Safari/537.36'
}
session = requests.session()
url = 'https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1'
response = session.get(url, headers=header)
soup = BeautifulSoup(response.content,"lxml")
rslt_table = soup.find("article")
for elem in rslt_table:
det_url = base_url+elem['href']
print det_url
我试了一下你的代码,最终解决了你的问题。我所做的更改是:
links=[]
headers=soup.findAll("h3")
for header in headers:
det_url = base_url+header.find('a')['href']
links.append(det_url)
print det_url
print links
我得到的输出是:
['https://beta.companieshouse.gov.uk//company/08569390', 'https://beta.companieshouse.gov.uk//company/09947251', 'https://beta.companieshouse.gov.uk//company/07352770', 'https://beta.companieshouse.gov.uk//company/07908180', 'https://beta.companieshouse.gov.uk//company/04576887', 'https://beta.companieshouse.gov.uk//company/08760943', 'https://beta.companieshouse.gov.uk//company/08265394', 'https://beta.companieshouse.gov.uk//company/03893510', 'https://beta.companieshouse.gov.uk//company/07422059', 'https://beta.companieshouse.gov.uk//company/08819027', 'https://beta.companieshouse.gov.uk//company/08325123', 'https://beta.companieshouse.gov.uk//company/09669365', 'https://beta.companieshouse.gov.uk//company/08641990', 'https://beta.companieshouse.gov.uk//company/06318392', 'https://beta.companieshouse.gov.uk//company/09400775', 'https://beta.companieshouse.gov.uk//company/01930797', 'https://beta.companieshouse.gov.uk//company/09398542', 'https://beta.companieshouse.gov.uk//company/07784981', 'https://beta.companieshouse.gov.uk//company/07480763', 'https://beta.companieshouse.gov.uk//company/06971238']
这一行:
rslt_table = soup.find("article")
returns 你一个 article
元素。当您这样做时:
for elem in rslt_table:
您正在遍历 article
的每个元素,因为它们是纯文本形式。因此 elem
是一个字符串,不能被另一个字符串索引,就像您尝试使用 elem["href"]
一样。您想要做的是在 rslt_table
:
a
元素,而不是字符串
for elem in rslt_table.find_all("a"):
改变这一行会给你你想要的。
soup.find("article")
不是您定位所有这些公司标签的方式,请尝试使用 find_all
代替:
base_url = 'https://beta.companieshouse.gov.uk'
companies = soup.find_all('a', {'title': 'View company'}) # to get all company <a> tags
for company in companies:
det_url = base_url+elem['href']
print det_url
如果您正在寻找特定邮政编码的公司,您可能更愿意下载此数据集而不是抓取:http://download.companieshouse.gov.uk/en_output.html
Companies House 还提供了一个 API,您可能会发现它很有用:https://developer.companieshouse.gov.uk/api/docs/