使用 Kiba（或纯 Ruby）在 ETL 过程中转置 CSV 行和列

Question

第三方系统生成 HTML table 的家长教师预订：

 Blocks    Teacher 1   Teacher 2   Teacher 3
3:00 pm      Stu A       Stu B
3:10 pm      Stu B                   Stu C
...
5:50 pm      Stu D       Stu A       Stu E

列数会根据有多少教师进行预订而变化。行数根据我们创建的槽数而变化。

最终结果需要是每个老师的哈希值，例如：

{ name: "Teacher 1", email: "teacher.1@school.edu", appointments: [
  { start: "15:00", end: "15:08", attendees: [
    { name: "Stu A Parent 1", email: "stuap1@example.com" },
    { name: "Stu A Parent 2", email: "stuap2@example.com" }
  ] },
  { start: "15:10", end: "15:18", attendees: [
    { name: "Stu B Parent", email: "stubp@example.com" }
  ] },
  ...
  { start: "17:50", end: "17:58", attendees: [
    { name: "Stu D Parent 1", email: "studp1@example.com" },
    { name: "Stu D Parent 2", email: "studp2@example.com" }
  ] },
] },

我认为 ETL 将每个教师作为一行处理是最有意义的，所以这次我在 Numbers 中调换了行和列并将其保存为 CSV：

Blocks,3:00 pm,3:10 pm,...,5:50 pm
Teacher 1,Stu A,Stu B,...,Stu D
Teacher 2,Stu B,,...,Stu C
Teacher 3,Stu D,Stu A,...,Stu E

我正在努力使整个过程尽可能简单，以便办公室工作人员使用，所以是否可以在 Kiba（或纯 Ruby）中进行行和列的转置？在 Kiba 中，我假设我必须处理所有行，为每个老师累积一个散列，然后在最后输出每个老师的散列？

Answer 1

Kiba 作者在这里！

我看到至少有两种方法可以做到这一点（无论你是使用普通 Ruby 还是使用 Kiba）：

将您的 HTML 转换为 table，然后使用该数据
直接使用 HTML table（使用 Nokogiri 和选择器），仅适用于 HTML 大部分是干净的

在所有情况下，因为你正在做一些抓取；我建议你有一个非常防御性的代码（因为 HTML 更改并且以后可能包含错误或 cornercases），例如关于行/列包含您期望的内容、验证等这一事实的有力断言

如果你选择简单 Ruby，那么你可以做类似的事情（这里将你的数据建模为用逗号分隔的文本以保持清晰）：

task :default do
  data = <<DOC
  Blocks  ,  Teacher 1  , Teacher 2  , Teacher 3
  3:00 pm  ,    Stu A   ,    Stu B   ,          
  3:10 pm   ,   Stu B   ,            ,    Stu C
DOC
  data = data.split("\n").map &->(x) { x.split(",").map(&:strip)}
  blocks, *teachers = data.transpose
  teachers.each do |teacher|
    pp blocks.zip(teacher)
  end
end

这将输出：

[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]
[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]
[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]

您可以根据自己的期望进行调整（但同样：要非常防御并在所有数据上到处断言，包括 table 中的单元格数量等，否则您将下车-逐一错误、不正确的时间表等）。

如果你想使用 Kiba 和 CSS 选择器，你可以这样做：

task :default do
  html = <<HTML
    <table>
      <tr>
        <th>Blocks</th>
        <th>Teacher 1</th>
        <th>Teacher 2</th>
        <th>Teacher 3</th>
      </tr>
      <tr>
        <td>3:00 pm</td>
        <td>Stu A</td>
        <td>Stu B</td>
        <td></td>
      </tr>
      <tr>
        <td>3:10 pm</td>
        <td>Stu B</td>
        <td></td>
        <td>Stu C</td>
      </tr>
    </table>
HTML
  require 'nokogiri'
  require 'kiba'
  require 'kiba-common/sources/enumerable'
  require 'kiba-common/transforms/enumerable_exploder'
  Kiba.run do
    # just one doc here, but we could have a sequence instead
    source Kiba::Common::Sources::Enumerable, -> { [html] }

    transform { |r| Nokogiri::HTML(r) }

    transform do |doc|
      Enumerator.new do |y|
        blocks, *teachers = doc.search("table tr:first th").map(&:text)
        # you'd have to add more defensive checks here!!! important!
        teachers.each_with_index do |t, i|
          headers = doc.search("table>tr>:nth-child(1)").map(&:text)
          data = doc.search("table>tr>:nth-child(#{i + 2})").map(&:text)
          y << { teacher: t, data: headers.zip(data) }
        end
      end
    end

    transform Kiba::Common::Transforms::EnumerableExploder

    transform { |r| pp r }
  end
end

这会给出：

{:teacher=>"Teacher 1",
 :data=>[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]}
{:teacher=>"Teacher 2",
 :data=>[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]}
{:teacher=>"Teacher 3",
 :data=>[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]}

我想我更喜欢两种方法的混合：首先将 HTML 转换为适当的 CSV 文件或内存中 table，然后第二步从那里转置。

Answer 2

假设我们有以下时间表。

schedule =<<~END
Blocks,15:00,15:10,15:55
Teacher 1,Stu A,Stu B,Stu C
Teacher 2,Stu B,Stu C,Stu A
Teacher 3,Stu C,Stu A,Stu B
END

要生成所需的哈希数组，我们需要其他信息。假设我们还给出了以下内容。

teacher_emails = {
  "Teacher 1"=>"teacher.1@school.edu",
  "Teacher 2"=>"teacher.2@school.edu",
  "Teacher 3"=>"teacher.3@school.edu"
}

parent_emails = {
  "Stu A"=> { "Parent 1"=>"stuap1@example.com",
              "Parent 2"=>"stuap2@example.com" },
  "Stu B"=> { "Parent"=>"stubp@example.com" },
  "Stu C"=> { "Parent 1"=>"stuapc@example.com",
              "Parent 2"=>"stuapc@example.com" }
}

mins_per_meeting = 8

然后我们可以进行如下处理。

blks, *sched = schedule.split(/\n/)
blks
  #=> "Blocks,15:00,15:10,15:55"
sched
  #=> ["Teacher 1,Stu A,Stu B,Stu C",
  #    "Teacher 2,Stu B,Stu C,Stu A",
  #    "Teacher 3,Stu C,Stu A,Stu B"]

time_blocks = blks.scan(/\d{1,2}:\D{2}/).map do |s|
  hr, min = s.split(':')
  mins_from_midnight = 60*(hr.to_i) + min.to_i
  { start: "%d:%02d" % mins_from_midnight.divmod(60),
  { end: "%d:%02d" % (mins_from_midnight + mins_per_meeting).divmod(60),
end
  #=> [{:start=>"15:00", :end=>"15:08"},
  #    {:start=>"15:10", :end=>"15:18"},
  #    {:start=>"15:55", :end=>"16:03"},

sched.map do |s|
  teacher, *students = s.split(',')
  { name: teacher,
    email: teacher_emails[teacher],
    appointments: time_blocks.zip(students).map do |tb,stud|
      tb.merge(
        { student: stud,
          attendees: parent_emails[stud].map do |par_name, par_email|
            { name: par_name, email: par_email }
          end
        }
      )
    end    
  }

end
  #=> [{:name=>"Teacher 1", :email=>"teacher.1@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu A",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuap1@example.com"},
  #          {:name=>"Parent 2", :email=>"stuap2@example.com"}
  #        ]
  #       },
  #       {:start=>"15:10", :end=>"15:18",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       {:start=>"15:55", :end=>"16:03",
  #        :student=>"Stu C",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuapc@example.com"},
  #          {:name=>"Parent 2", :email=>"stuapc@example.com"}
  #        ]
  #       }
  #     ]
  #    },

  #    {:name=>"Teacher 2", :email=>"teacher.2@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       ....

使用 Kiba（或纯 Ruby）在 ETL 过程中转置 CSV 行和列

Transpose CSV rows and columns during ETL process using Kiba (or plain Ruby)

ruby

csv

etl

kiba-etl