抓取 URL 列表并绕过那些没有 DNS 的
Crawling list of URLs and bypass those with no DNS
我正在使用 Ruby 抓取大量 URL,但我拥有的所有 URL 均未激活且未与 DNS 相关联。当我点击 url 我的爬虫错误时。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'colorize'
URL_LIST = [
'http://website.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
end
当我 运行 这个脚本遇到错误时 url 我收到这样的错误:
/Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (SocketError)
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response'
from spider.rb:808:in `block in <main>'
from spider.rb:806:in `each'
from spider.rb:806:in `<main>'
使用begin/rescue块来挽救错误并以红色输出错误信息:
URL_LIST = [
'http://website.com',
'http://sdfasdfwqeasdfasdfr.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
begin
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
rescue SocketError => e
puts "Error: #{url} - #{e}".red
end
end
输出将如下所示:
Redirect http://website.com => http://www.website.com/
Error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known
Success: http://website.net
我正在使用 Ruby 抓取大量 URL,但我拥有的所有 URL 均未激活且未与 DNS 相关联。当我点击 url 我的爬虫错误时。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'colorize'
URL_LIST = [
'http://website.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
end
当我 运行 这个脚本遇到错误时 url 我收到这样的错误:
/Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (SocketError)
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response'
from spider.rb:808:in `block in <main>'
from spider.rb:806:in `each'
from spider.rb:806:in `<main>'
使用begin/rescue块来挽救错误并以红色输出错误信息:
URL_LIST = [
'http://website.com',
'http://sdfasdfwqeasdfasdfr.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
begin
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
rescue SocketError => e
puts "Error: #{url} - #{e}".red
end
end
输出将如下所示:
Redirect http://website.com => http://www.website.com/
Error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known
Success: http://website.net