坚持使用这个网络抓取工具
Stuck with this web scraper
我正在尝试使用 BeautifulSoup 在 Python 2.7 中构建一个程序,它将从此页面和后续页面中提取所有配置文件 URL
http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=1&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2
我已经和这个程序抗争了很长时间了,但还是不行。我想我搞砸了 CSS 选择器,但我不确定还能尝试什么。
请指教...我是编程新手,python
import requests
from bs4 import BeautifulSoup
def re_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.select('tr.alternate td a[id*=ct100_]'):
href = link.get('href')
print (href)
page += 1
re_crawler(2)
改用这个?
from urllib import urlopen
from bs4 import BeautifulSoup
def re_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
code = urlopen(url)
soup = BeautifulSoup(code)
for link in soup.select('tr.alternate td a[id*=ctl00_]'):
href = link.get('href')
print (href)
page += 1
re_crawler(2)
我正在尝试使用 BeautifulSoup 在 Python 2.7 中构建一个程序,它将从此页面和后续页面中提取所有配置文件 URL
http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=1&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2
我已经和这个程序抗争了很长时间了,但还是不行。我想我搞砸了 CSS 选择器,但我不确定还能尝试什么。
请指教...我是编程新手,python
import requests
from bs4 import BeautifulSoup
def re_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.select('tr.alternate td a[id*=ct100_]'):
href = link.get('href')
print (href)
page += 1
re_crawler(2)
改用这个?
from urllib import urlopen
from bs4 import BeautifulSoup
def re_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
code = urlopen(url)
soup = BeautifulSoup(code)
for link in soup.select('tr.alternate td a[id*=ctl00_]'):
href = link.get('href')
print (href)
page += 1
re_crawler(2)