Ruby 以高效的方式使用另一个数组对哈希数组进行排序,因此处理时间是恒定的
Ruby sort order of array of hash using another array in an efficient way so processing time is constant
我有一些数据需要导出为 csv。目前大约有 10,000 条记录,并将继续增长,因此我想要一种有效的方法来进行迭代,特别是关于 运行 几个循环,一个接一个。
我的问题是,是否有办法避免我在下面描述的许多 each 循环,如果没有,除了 Ruby 的 each/map 之外,我还可以使用其他方法来保持处理时间不变,而不管数据大小如何。
例如:
首先,我将遍历整个数据以展平并重命名保存数组值的字段,以便像 hol 数组值这样的问题的字段会出现 issue_1 和 issue_1 如果它只包含数组中的两项。
接下来我将执行另一个循环以获取哈希数组中的所有唯一键。
使用第 2 步中的唯一键,我将执行另一个循环,使用另一个数组对这些唯一键进行排序,该数组保存键的排列顺序。
最后是另一个生成 CSV 的循环
所以我每次都使用 Ruby 的 each/map 对数据进行 4 次迭代,完成此循环的时间会随着数据大小的增加而增加。
原始数据格式如下:
def data
[
{"file"=> ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified" => "2019-12-24", "book_title_1"=>"", "title"=> ["haha"], "edition"=> [""], "issue" => ["nov"], "creator" => ["yes", "some"], "publisher"=> ["Library"], "place_of_publication" => "London, UK"]},
{"file" => ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified"=>"2019-12-24", "book_title"=> [""], "title" => ["try"], "edition"=> [""], "issue"=> ["dec", 'ten'], "creator"=> ["tako", "bell", 'big mac'], "publisher"=> ["Library"], "place_of_publication" => "NY, USA"}]
end
通过展平数组并重命名保存这些数组的键来重新映射日期
def csv_data
@csv_data = [
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"haha", "edition_1"=>"", "issue_1"=>"nov", "creator_1"=>"yes", "creator_2"=>"some", "publisher_1"=>"Library", "place_of_publication_1"=>"London, UK"},
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"try", "edition_1"=>"", "issue_1"=>"dec", "issue_2" => 'ten', "creator_1"=>"tako", "creator_2"=>"bell", 'creator_3' => 'big mac', "publisher_1"=>"Library", "place_of_publication_1"=>"NY, USA"}]
end
对上述数据headers进行排序
def csv_header
csv_order = ["id", "edition_1", "date_uploaded", "creator_1", "creator_2", "creator_3", "book_title_1", "publisher_1", "file_1", "place_of_publication_1", "journal_title_1", "issue_1", "issue_2", "date_modified"]
headers_object = []
sorted_header = []
all_keys = csv_data.lazy.flat_map(&:keys).force.uniq.compact
#resort using ordering by suffix eg creator_isni_1 comes before creator_isni_2
all_keys = all_keys.sort_by{ |name| [name[/\d+/].to_i, name] }
csv_order.each {|k| all_keys.select {|e| sorted_header << e if e.start_with? k} }
sorted_header.uniq
end
生成csv也涉及更多循环:
def to_csv
data = csv_data
sorted_headers = csv_header(data)
csv = CSV.generate(headers: true) do |csv|
csv << sorted_header
csv_data.lazy.each do |hash|
csv << hash.values_at(*sorted_header)
end
end
end
说实话,我更感兴趣的是我是否能够在没有进一步描述的情况下找出你想要的逻辑,而不是单独的编程部分(当然我也很喜欢它,它一直年龄我做了一些 Ruby,这是一个很好的复习)。由于任务没有明确说明,因此必须"distilled"通过阅读您的描述,输入数据和代码。
我认为你应该做的是将所有内容都保存在非常基本和轻量级的数组中,并在读取数据时一步一步完成繁重的工作。
我还假设如果一个键以数字结尾,或者如果一个值是一个数组,你希望它作为 {key}_{n} 返回,即使只有一个值存在。
到目前为止我想出了这段代码(评论中描述的逻辑)和repl demo here
class CustomData
# @keys array structure
# 0: Key
# 1: Maximum amount of values associated
# 2: Is an array (Found a {key}_n key in feed,
# or value in feed was an array)
#
# @data: is a simple array of arrays
attr_accessor :keys, :data
CSV_ORDER = %w[
id edition date_uploaded creator book_title publisher
file place_of_publication journal_title issue date_modified
]
def initialize(feed)
@keys = CSV_ORDER.map { |key| [key, 0, false]}
@data = []
feed.each do |row|
new_row = []
# Sort keys in order to maintain the right order for {key}_{n} values
row.sort_by { |key, _| key }.each do |key, value|
is_array = false
if key =~ /_\d+$/
# If key ends with a number, extract key
# and remember it is an array for the output
key, is_array = key[/^(.*)_\d+$/, 1], true
end
if value.is_a? Array
# If value is an array, even if the key did not end with a number,
# we remember that for the output
is_array = true
else
value = [value]
end
# Find position of key if exists or nil
key_index = @keys.index { |a| a.first == key }
if key_index
# If you could have a combination of _n keys and array values
# for a key in your feed, you need to change this portion here
# to account for all previous values, which would add some complexity
#
# If current amount of values is greater than the saved one, override
@keys[key_index][1] = value.length if @keys[key_index][1] < value.length
@keys[key_index][2] = true if is_array and not @keys[key_index][2]
else
# It is a new key in @keys array
key_index = @keys.length
@keys << [key, value.length, is_array]
end
# Add value array at known key index
# (will be padded with nil if idx is greater than array size)
new_row[key_index] = value
end
@data << new_row
end
end
def to_csv_data(headers=true)
result, header, body = [], [], []
if headers
@keys.each do |key|
if key[2]
# If the key should hold multiple values, build the header string
key[1].times { |i| header << "#{key[0]}_#{i+1}" }
else
# Otherwise it is a singular value and the header goes unmodified
header << key[0]
end
end
result << header
end
@data.each do |row|
new_row = []
row.each_with_index do |value, index|
# Use the value counter from @keys to pad with nil values,
# if a value is not present
@keys[index][1].times do |count|
new_row << value[count]
end
end
body << new_row
end
result << body
end
end
我有一些数据需要导出为 csv。目前大约有 10,000 条记录,并将继续增长,因此我想要一种有效的方法来进行迭代,特别是关于 运行 几个循环,一个接一个。 我的问题是,是否有办法避免我在下面描述的许多 each 循环,如果没有,除了 Ruby 的 each/map 之外,我还可以使用其他方法来保持处理时间不变,而不管数据大小如何。
例如:
首先,我将遍历整个数据以展平并重命名保存数组值的字段,以便像 hol 数组值这样的问题的字段会出现 issue_1 和 issue_1 如果它只包含数组中的两项。
接下来我将执行另一个循环以获取哈希数组中的所有唯一键。
使用第 2 步中的唯一键,我将执行另一个循环,使用另一个数组对这些唯一键进行排序,该数组保存键的排列顺序。
最后是另一个生成 CSV 的循环
所以我每次都使用 Ruby 的 each/map 对数据进行 4 次迭代,完成此循环的时间会随着数据大小的增加而增加。
原始数据格式如下:
def data
[
{"file"=> ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified" => "2019-12-24", "book_title_1"=>"", "title"=> ["haha"], "edition"=> [""], "issue" => ["nov"], "creator" => ["yes", "some"], "publisher"=> ["Library"], "place_of_publication" => "London, UK"]},
{"file" => ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified"=>"2019-12-24", "book_title"=> [""], "title" => ["try"], "edition"=> [""], "issue"=> ["dec", 'ten'], "creator"=> ["tako", "bell", 'big mac'], "publisher"=> ["Library"], "place_of_publication" => "NY, USA"}]
end
通过展平数组并重命名保存这些数组的键来重新映射日期
def csv_data
@csv_data = [
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"haha", "edition_1"=>"", "issue_1"=>"nov", "creator_1"=>"yes", "creator_2"=>"some", "publisher_1"=>"Library", "place_of_publication_1"=>"London, UK"},
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"try", "edition_1"=>"", "issue_1"=>"dec", "issue_2" => 'ten', "creator_1"=>"tako", "creator_2"=>"bell", 'creator_3' => 'big mac', "publisher_1"=>"Library", "place_of_publication_1"=>"NY, USA"}]
end
对上述数据headers进行排序
def csv_header
csv_order = ["id", "edition_1", "date_uploaded", "creator_1", "creator_2", "creator_3", "book_title_1", "publisher_1", "file_1", "place_of_publication_1", "journal_title_1", "issue_1", "issue_2", "date_modified"]
headers_object = []
sorted_header = []
all_keys = csv_data.lazy.flat_map(&:keys).force.uniq.compact
#resort using ordering by suffix eg creator_isni_1 comes before creator_isni_2
all_keys = all_keys.sort_by{ |name| [name[/\d+/].to_i, name] }
csv_order.each {|k| all_keys.select {|e| sorted_header << e if e.start_with? k} }
sorted_header.uniq
end
生成csv也涉及更多循环:
def to_csv
data = csv_data
sorted_headers = csv_header(data)
csv = CSV.generate(headers: true) do |csv|
csv << sorted_header
csv_data.lazy.each do |hash|
csv << hash.values_at(*sorted_header)
end
end
end
说实话,我更感兴趣的是我是否能够在没有进一步描述的情况下找出你想要的逻辑,而不是单独的编程部分(当然我也很喜欢它,它一直年龄我做了一些 Ruby,这是一个很好的复习)。由于任务没有明确说明,因此必须"distilled"通过阅读您的描述,输入数据和代码。
我认为你应该做的是将所有内容都保存在非常基本和轻量级的数组中,并在读取数据时一步一步完成繁重的工作。 我还假设如果一个键以数字结尾,或者如果一个值是一个数组,你希望它作为 {key}_{n} 返回,即使只有一个值存在。
到目前为止我想出了这段代码(评论中描述的逻辑)和repl demo here
class CustomData
# @keys array structure
# 0: Key
# 1: Maximum amount of values associated
# 2: Is an array (Found a {key}_n key in feed,
# or value in feed was an array)
#
# @data: is a simple array of arrays
attr_accessor :keys, :data
CSV_ORDER = %w[
id edition date_uploaded creator book_title publisher
file place_of_publication journal_title issue date_modified
]
def initialize(feed)
@keys = CSV_ORDER.map { |key| [key, 0, false]}
@data = []
feed.each do |row|
new_row = []
# Sort keys in order to maintain the right order for {key}_{n} values
row.sort_by { |key, _| key }.each do |key, value|
is_array = false
if key =~ /_\d+$/
# If key ends with a number, extract key
# and remember it is an array for the output
key, is_array = key[/^(.*)_\d+$/, 1], true
end
if value.is_a? Array
# If value is an array, even if the key did not end with a number,
# we remember that for the output
is_array = true
else
value = [value]
end
# Find position of key if exists or nil
key_index = @keys.index { |a| a.first == key }
if key_index
# If you could have a combination of _n keys and array values
# for a key in your feed, you need to change this portion here
# to account for all previous values, which would add some complexity
#
# If current amount of values is greater than the saved one, override
@keys[key_index][1] = value.length if @keys[key_index][1] < value.length
@keys[key_index][2] = true if is_array and not @keys[key_index][2]
else
# It is a new key in @keys array
key_index = @keys.length
@keys << [key, value.length, is_array]
end
# Add value array at known key index
# (will be padded with nil if idx is greater than array size)
new_row[key_index] = value
end
@data << new_row
end
end
def to_csv_data(headers=true)
result, header, body = [], [], []
if headers
@keys.each do |key|
if key[2]
# If the key should hold multiple values, build the header string
key[1].times { |i| header << "#{key[0]}_#{i+1}" }
else
# Otherwise it is a singular value and the header goes unmodified
header << key[0]
end
end
result << header
end
@data.each do |row|
new_row = []
row.each_with_index do |value, index|
# Use the value counter from @keys to pad with nil values,
# if a value is not present
@keys[index][1].times do |count|
new_row << value[count]
end
end
body << new_row
end
result << body
end
end