在 ruby nokogiri 网络抓取工具中拆分子数组

Splitting subarrays in ruby nokogiri web scraper

您好,我刚刚完成了以下教程:https://github.com/ryandhaase/Web-Scraper/blob/master/airbnb_scraper.rb and https://medium.com/@tabor_francesca/web-scraper-airbnb-24d67939b08a#.mg7ny2tke。而我现在正在练习。我在拆分子数组时遇到问题。一切正常,但我无法将城市、州和邮政编码拆分为单独的 excel 列。

下面这行不正确,我该如何修正?

city << [subarray[0], "this is not working", subarray[1]]

我猜还有另一行需要修复。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'


url = "https://www.tesla.com/findus/list/stores/United+States"

page = Nokogiri::HTML(open(url))

page = Nokogiri::HTML(open("https://www.tesla.com/findus/list/stores/United+States"))   
puts page.class   

name = []
street_address = []
extended_address = []
city = []
state = []
zip = []


    page.css('a.fn.org.url').each do |line|
      name << line.text.strip
    end

    page.css('span.street-address').each do |line|
      street_address << line.text
    end

    page.css('span.extended-address').each do |line|
        extended_address << line.text
    end

    page.css('span.locality').each do |line|
        subarray = line.text.strip.split(/ · /)

        if subarray.length == 3
            city << subarray
        else
            city << [subarray[0], "this is not working", subarray[1]]
    end

  end



CSV.open("teslaStores.csv", "w") do |file|
  file << ["Name", "Street Address", "Street Address Continued", "City", "State", "Zip"]

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], city[i][0], city[i][1]]
  end
end

仅供参考,这是未经测试的,但 Ruby 中的代码更惯用:

require 'csv'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open('https://www.tesla.com/findus/list/stores/United+States'))   

name = page.css('a.fn.org.url').map{ |n| n.text.strip }
street_address = page.css('span.street-address').map { |n| n.text }
extended_address = page.css('span.extended-address').map{ |n| n.text }

city = page.css('span.locality').map { |n|
  subarray = n.text.strip.split(/ · /)

  if subarray.length == 3
    subarray
  else
    [subarray[0], 'this is not working', subarray[1]]
  end

}

CSV.open('teslaStores.csv', 'w') do |file|
  file << ['Name', 'Street Address', 'Street Address Continued', 'City', 'State', 'Zip']

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], city[i][0], city[i][1]]
  end
end

还可以进一步减少:

street_address, extended_address = [
  'span.street-address',
  'span.extended-address'
].map{ |selector|
  page.css(selector).map { |n| n.text }
}

所以,我参加了 python 的 meetup.com 活动,并询问了其中一项帮助说明,即使 class 不在这个主题中 :)。老师解释说我需要用逗号和空格分隔。以前我按句点分开的地方。

我不得不改变这个:

page.css('span.locality').each do |line|
        subarray = line.text.strip.split(/ · /)

        if subarray.length == 3
            city << subarray
        else
            city << [subarray[0], "this is not working", subarray[1]]
    end

为此:

page.css('span.locality').each do |line|
        subarray = line.text.strip.split(',')
        subarray2 = subarray[1].split(' ')

          city << subarray[0]
          state << subarray2[0]
          zip << subarray2[1]
    end

完整答案如下:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'


url = "https://www.tesla.com/findus/list/stores/United+States"

page = Nokogiri::HTML(open(url))

page = Nokogiri::HTML(open("https://www.tesla.com/findus/list/stores/United+States"))   
puts page.class   

name = []
street_address = []
extended_address = []
city = []
state = []
zip = []


    page.css('a.fn.org.url').each do |line|
      name << line.text.strip
    end

    page.css('span.street-address').each do |line|
      street_address << line.text
    end

    page.css('span.extended-address').each do |line|
        extended_address << line.text
    end

    page.css('span.locality').each do |line|
        subarray = line.text.strip.split(',')
        subarray2 = subarray[1].split(' ')

          city << subarray[0]
          state << subarray2[0]
          zip << subarray2[1]
    end


CSV.open("teslaStores.csv", "w") do |file|
  file << ["Name", "Street Address", "Street Address Continued", "City", "State", "Zip"]

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], state[i], zip[i]]
  end
end