Ruby 以高效的方式使用另一个数组对哈希数组进行排序,因此处理时间是恒定的

Ruby sort order of array of hash using another array in an efficient way so processing time is constant

我有一些数据需要导出为 csv。目前大约有 10,000 条记录,并将继续增长,因此我想要一种有效的方法来进行迭代,特别是关于 运行 几个循环,一个接一个。 我的问题是,是否有办法避免我在下面描述的许多 each 循环,如果没有,除了 Ruby 的 each/map 之外,我还可以使用其他方法来保持处理时间不变,而不管数据大小如何。

例如:

  1. 首先,我将遍历整个数据以展平并重命名保存数组值的字段,以便像 hol 数组值这样的问题的字段会出现 issue_1 和 issue_1 如果它只包含数组中的两项。

  2. 接下来我将执行另一个循环以获取哈希数组中的所有唯一键。

  3. 使用第 2 步中的唯一键,我将执行另一个循环,使用另一个数组对这些唯一键进行排序,该数组保存键的排列顺序。

  4. 最后是另一个生成 CSV 的循环

所以我每次都使用 Ruby 的 each/map 对数据进行 4 次迭代,完成此循环的时间会随着数据大小的增加而增加。

原始数据格式如下:

def data
  [
     {"file"=> ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified" => "2019-12-24", "book_title_1"=>"", "title"=> ["haha"], "edition"=> [""], "issue" => ["nov"], "creator" => ["yes", "some"], "publisher"=> ["Library"], "place_of_publication" => "London, UK"]},

    {"file" => ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified"=>"2019-12-24", "book_title"=> [""], "title" => ["try"], "edition"=> [""], "issue"=> ["dec", 'ten'], "creator"=> ["tako", "bell", 'big mac'], "publisher"=> ["Library"], "place_of_publication" => "NY, USA"}]
end

通过展平数组并重命名保存这些数组的键来重新映射日期

def csv_data
  @csv_data = [
     {"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"haha", "edition_1"=>"", "issue_1"=>"nov", "creator_1"=>"yes", "creator_2"=>"some", "publisher_1"=>"Library", "place_of_publication_1"=>"London, UK"},

    {"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"try", "edition_1"=>"", "issue_1"=>"dec", "issue_2" => 'ten', "creator_1"=>"tako", "creator_2"=>"bell", 'creator_3' => 'big mac', "publisher_1"=>"Library", "place_of_publication_1"=>"NY, USA"}]

end

对上述数据headers进行排序

def csv_header

  csv_order = ["id", "edition_1", "date_uploaded",  "creator_1", "creator_2", "creator_3", "book_title_1", "publisher_1", "file_1", "place_of_publication_1", "journal_title_1", "issue_1", "issue_2", "date_modified"]

  headers_object = []
  sorted_header = []
  all_keys = csv_data.lazy.flat_map(&:keys).force.uniq.compact

  #resort using ordering by suffix eg creator_isni_1 comes before creator_isni_2
  all_keys = all_keys.sort_by{ |name| [name[/\d+/].to_i, name] }

  csv_order.each {|k| all_keys.select {|e| sorted_header << e if e.start_with? k} }

  sorted_header.uniq
end

生成csv也涉及更多循环:

def to_csv
  data = csv_data
  sorted_headers = csv_header(data)

  csv = CSV.generate(headers: true) do |csv|
    csv << sorted_header
    csv_data.lazy.each do |hash|
      csv << hash.values_at(*sorted_header)
    end
  end
end

说实话,我更感兴趣的是我是否能够在没有进一步描述的情况下找出你想要的逻辑,而不是单独的编程部分(当然我也很喜欢它,它一直年龄我做了一些 Ruby,这是一个很好的复习)。由于任务没有明确说明,因此必须"distilled"通过阅读您的描述,输入数据和代码。

我认为你应该做的是将所有内容都保存在非常基本和轻量级的数组中,并在读取数据时一步一步完成繁重的工作。 我还假设如果一个键以数字结尾,或者如果一个值是一个数组,你希望它作为 {key}_{n} 返回,即使只有一个值存在。

到目前为止我想出了这段代码(评论中描述的逻辑)和repl demo here

class CustomData
  # @keys array structure
  # 0: Key
  # 1: Maximum amount of values associated
  # 2: Is an array (Found a {key}_n key in feed,
  #    or value in feed was an array)
  #
  # @data: is a simple array of arrays
  attr_accessor :keys, :data
  CSV_ORDER = %w[
    id edition date_uploaded creator book_title publisher
    file place_of_publication journal_title issue date_modified
  ]

  def initialize(feed)
    @keys = CSV_ORDER.map { |key| [key, 0, false]}
    @data = []
    feed.each do |row|
      new_row = []
      # Sort keys in order to maintain the right order for {key}_{n} values
      row.sort_by { |key, _| key }.each do |key, value|
        is_array = false
        if key =~ /_\d+$/
          # If key ends with a number, extract key
          # and remember it is an array for the output
          key, is_array = key[/^(.*)_\d+$/, 1], true
        end
        if value.is_a? Array
          # If value is an array, even if the key did not end with a number,
          # we remember that for the output
          is_array = true
        else
          value = [value]
        end
        # Find position of key if exists or nil
        key_index = @keys.index { |a| a.first == key }
        if key_index
          # If you could have a combination of _n keys and array values
          # for a key in your feed, you need to change this portion here
          # to account for all previous values, which would add some complexity
          #
          # If current amount of values is greater than the saved one, override
          @keys[key_index][1] = value.length if @keys[key_index][1] < value.length
          @keys[key_index][2] = true if is_array and not @keys[key_index][2]
        else
          # It is a new key in @keys array
          key_index = @keys.length
          @keys << [key, value.length, is_array]
        end
        # Add value array at known key index
        # (will be padded with nil if idx is greater than array size)
        new_row[key_index] = value
      end
      @data << new_row
    end
  end

  def to_csv_data(headers=true)
    result, header, body = [], [], []
    if headers
      @keys.each do |key|
        if key[2]
          # If the key should hold multiple values, build the header string
          key[1].times { |i| header << "#{key[0]}_#{i+1}" }
        else
          # Otherwise it is a singular value and the header goes unmodified
          header << key[0]
        end
      end
      result << header
    end
    @data.each do |row|
      new_row = []
      row.each_with_index do |value, index|
        # Use the value counter from @keys to pad with nil values,
        # if a value is not present
        @keys[index][1].times do |count|
          new_row << value[count]
        end
      end
      body << new_row
    end
    result << body
  end

end