抓取 URL 列表并绕过那些没有 DNS 的

Crawling list of URLs and bypass those with no DNS

我正在使用 Ruby 抓取大量 URL,但我拥有的所有 URL 均未激活且未与 DNS 相关联。当我点击 url 我的爬虫错误时。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'colorize'

URL_LIST = [
  'http://website.com',
  'http://website.net'
]

URL_LIST.each do |url|
  item = "#{url}"
  resp = Net::HTTP.get_response(URI.parse(item))

  case resp.code.to_i
  when 200
    puts "Success: #{url}".green
  when 301..303
    new_url = resp['location']
    puts "Redirect #{url} => #{new_url}".yellow
  else
    resp.code
  end
end

当我 运行 这个脚本遇到错误时 url 我收到这样的错误:

/Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (SocketError)
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response'
from spider.rb:808:in `block in <main>'
from spider.rb:806:in `each'
from spider.rb:806:in `<main>'

使用begin/rescue块来挽救错误并以红色输出错误信息:

URL_LIST = [
  'http://website.com',
  'http://sdfasdfwqeasdfasdfr.com',
  'http://website.net'
]

URL_LIST.each do |url|
  item = "#{url}"

  begin
    resp = Net::HTTP.get_response(URI.parse(item))

    case resp.code.to_i
    when 200
      puts "Success: #{url}".green
    when 301..303
      new_url = resp['location']
      puts "Redirect #{url} => #{new_url}".yellow
    else
      resp.code
    end
  rescue SocketError => e
    puts "Error: #{url} - #{e}".red
  end
end

输出将如下所示:

Redirect http://website.com => http://www.website.com/
Error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known
Success: http://website.net