使用 Kiba(或纯 Ruby)在 ETL 过程中转置 CSV 行和列
Transpose CSV rows and columns during ETL process using Kiba (or plain Ruby)
第三方系统生成 HTML table 的家长教师预订:
Blocks Teacher 1 Teacher 2 Teacher 3
3:00 pm Stu A Stu B
3:10 pm Stu B Stu C
...
5:50 pm Stu D Stu A Stu E
列数会根据有多少教师进行预订而变化。行数根据我们创建的槽数而变化。
最终结果需要是每个老师的哈希值,例如:
{ name: "Teacher 1", email: "teacher.1@school.edu", appointments: [
{ start: "15:00", end: "15:08", attendees: [
{ name: "Stu A Parent 1", email: "stuap1@example.com" },
{ name: "Stu A Parent 2", email: "stuap2@example.com" }
] },
{ start: "15:10", end: "15:18", attendees: [
{ name: "Stu B Parent", email: "stubp@example.com" }
] },
...
{ start: "17:50", end: "17:58", attendees: [
{ name: "Stu D Parent 1", email: "studp1@example.com" },
{ name: "Stu D Parent 2", email: "studp2@example.com" }
] },
] },
我认为 ETL 将每个教师作为一行处理是最有意义的,所以这次我在 Numbers 中调换了行和列并将其保存为 CSV:
Blocks,3:00 pm,3:10 pm,...,5:50 pm
Teacher 1,Stu A,Stu B,...,Stu D
Teacher 2,Stu B,,...,Stu C
Teacher 3,Stu D,Stu A,...,Stu E
我正在努力使整个过程尽可能简单,以便办公室工作人员使用,所以是否可以在 Kiba(或纯 Ruby)中进行行和列的转置?在 Kiba 中,我假设我必须处理所有行,为每个老师累积一个散列,然后在最后输出每个老师的散列?
Kiba 作者在这里!
我看到至少有两种方法可以做到这一点(无论你是使用普通 Ruby 还是使用 Kiba):
- 将您的 HTML 转换为 table,然后使用该数据
- 直接使用 HTML table(使用 Nokogiri 和选择器),仅适用于 HTML 大部分是干净的
在所有情况下,因为你正在做一些抓取;我建议你有一个非常防御性的代码(因为 HTML 更改并且以后可能包含错误或 cornercases),例如关于行/列包含您期望的内容、验证等这一事实的有力断言
如果你选择简单 Ruby,那么你可以做类似的事情(这里将你的数据建模为用逗号分隔的文本以保持清晰):
task :default do
data = <<DOC
Blocks , Teacher 1 , Teacher 2 , Teacher 3
3:00 pm , Stu A , Stu B ,
3:10 pm , Stu B , , Stu C
DOC
data = data.split("\n").map &->(x) { x.split(",").map(&:strip)}
blocks, *teachers = data.transpose
teachers.each do |teacher|
pp blocks.zip(teacher)
end
end
这将输出:
[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]
[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]
[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]
您可以根据自己的期望进行调整(但同样:要非常防御并在所有数据上到处断言,包括 table 中的单元格数量等,否则您将下车-逐一错误、不正确的时间表等)。
如果你想使用 Kiba 和 CSS 选择器,你可以这样做:
task :default do
html = <<HTML
<table>
<tr>
<th>Blocks</th>
<th>Teacher 1</th>
<th>Teacher 2</th>
<th>Teacher 3</th>
</tr>
<tr>
<td>3:00 pm</td>
<td>Stu A</td>
<td>Stu B</td>
<td></td>
</tr>
<tr>
<td>3:10 pm</td>
<td>Stu B</td>
<td></td>
<td>Stu C</td>
</tr>
</table>
HTML
require 'nokogiri'
require 'kiba'
require 'kiba-common/sources/enumerable'
require 'kiba-common/transforms/enumerable_exploder'
Kiba.run do
# just one doc here, but we could have a sequence instead
source Kiba::Common::Sources::Enumerable, -> { [html] }
transform { |r| Nokogiri::HTML(r) }
transform do |doc|
Enumerator.new do |y|
blocks, *teachers = doc.search("table tr:first th").map(&:text)
# you'd have to add more defensive checks here!!! important!
teachers.each_with_index do |t, i|
headers = doc.search("table>tr>:nth-child(1)").map(&:text)
data = doc.search("table>tr>:nth-child(#{i + 2})").map(&:text)
y << { teacher: t, data: headers.zip(data) }
end
end
end
transform Kiba::Common::Transforms::EnumerableExploder
transform { |r| pp r }
end
end
这会给出:
{:teacher=>"Teacher 1",
:data=>[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]}
{:teacher=>"Teacher 2",
:data=>[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]}
{:teacher=>"Teacher 3",
:data=>[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]}
我想我更喜欢两种方法的混合:首先将 HTML 转换为适当的 CSV 文件或内存中 table,然后第二步从那里转置。
假设我们有以下时间表。
schedule =<<~END
Blocks,15:00,15:10,15:55
Teacher 1,Stu A,Stu B,Stu C
Teacher 2,Stu B,Stu C,Stu A
Teacher 3,Stu C,Stu A,Stu B
END
要生成所需的哈希数组,我们需要其他信息。假设我们还给出了以下内容。
teacher_emails = {
"Teacher 1"=>"teacher.1@school.edu",
"Teacher 2"=>"teacher.2@school.edu",
"Teacher 3"=>"teacher.3@school.edu"
}
parent_emails = {
"Stu A"=> { "Parent 1"=>"stuap1@example.com",
"Parent 2"=>"stuap2@example.com" },
"Stu B"=> { "Parent"=>"stubp@example.com" },
"Stu C"=> { "Parent 1"=>"stuapc@example.com",
"Parent 2"=>"stuapc@example.com" }
}
mins_per_meeting = 8
然后我们可以进行如下处理。
blks, *sched = schedule.split(/\n/)
blks
#=> "Blocks,15:00,15:10,15:55"
sched
#=> ["Teacher 1,Stu A,Stu B,Stu C",
# "Teacher 2,Stu B,Stu C,Stu A",
# "Teacher 3,Stu C,Stu A,Stu B"]
time_blocks = blks.scan(/\d{1,2}:\D{2}/).map do |s|
hr, min = s.split(':')
mins_from_midnight = 60*(hr.to_i) + min.to_i
{ start: "%d:%02d" % mins_from_midnight.divmod(60),
{ end: "%d:%02d" % (mins_from_midnight + mins_per_meeting).divmod(60),
end
#=> [{:start=>"15:00", :end=>"15:08"},
# {:start=>"15:10", :end=>"15:18"},
# {:start=>"15:55", :end=>"16:03"},
sched.map do |s|
teacher, *students = s.split(',')
{ name: teacher,
email: teacher_emails[teacher],
appointments: time_blocks.zip(students).map do |tb,stud|
tb.merge(
{ student: stud,
attendees: parent_emails[stud].map do |par_name, par_email|
{ name: par_name, email: par_email }
end
}
)
end
}
end
#=> [{:name=>"Teacher 1", :email=>"teacher.1@school.edu",
# :appointments=>[
# {:start=>"15:00", :end=>"15:08",
# :student=>"Stu A",
# :attendees=>[
# {:name=>"Parent 1", :email=>"stuap1@example.com"},
# {:name=>"Parent 2", :email=>"stuap2@example.com"}
# ]
# },
# {:start=>"15:10", :end=>"15:18",
# :student=>"Stu B",
# :attendees=>[
# {:name=>"Parent", :email=>"stubp@example.com"}
# ]
# },
# {:start=>"15:55", :end=>"16:03",
# :student=>"Stu C",
# :attendees=>[
# {:name=>"Parent 1", :email=>"stuapc@example.com"},
# {:name=>"Parent 2", :email=>"stuapc@example.com"}
# ]
# }
# ]
# },
# {:name=>"Teacher 2", :email=>"teacher.2@school.edu",
# :appointments=>[
# {:start=>"15:00", :end=>"15:08",
# :student=>"Stu B",
# :attendees=>[
# {:name=>"Parent", :email=>"stubp@example.com"}
# ]
# },
# ....
第三方系统生成 HTML table 的家长教师预订:
Blocks Teacher 1 Teacher 2 Teacher 3
3:00 pm Stu A Stu B
3:10 pm Stu B Stu C
...
5:50 pm Stu D Stu A Stu E
列数会根据有多少教师进行预订而变化。行数根据我们创建的槽数而变化。
最终结果需要是每个老师的哈希值,例如:
{ name: "Teacher 1", email: "teacher.1@school.edu", appointments: [
{ start: "15:00", end: "15:08", attendees: [
{ name: "Stu A Parent 1", email: "stuap1@example.com" },
{ name: "Stu A Parent 2", email: "stuap2@example.com" }
] },
{ start: "15:10", end: "15:18", attendees: [
{ name: "Stu B Parent", email: "stubp@example.com" }
] },
...
{ start: "17:50", end: "17:58", attendees: [
{ name: "Stu D Parent 1", email: "studp1@example.com" },
{ name: "Stu D Parent 2", email: "studp2@example.com" }
] },
] },
我认为 ETL 将每个教师作为一行处理是最有意义的,所以这次我在 Numbers 中调换了行和列并将其保存为 CSV:
Blocks,3:00 pm,3:10 pm,...,5:50 pm
Teacher 1,Stu A,Stu B,...,Stu D
Teacher 2,Stu B,,...,Stu C
Teacher 3,Stu D,Stu A,...,Stu E
我正在努力使整个过程尽可能简单,以便办公室工作人员使用,所以是否可以在 Kiba(或纯 Ruby)中进行行和列的转置?在 Kiba 中,我假设我必须处理所有行,为每个老师累积一个散列,然后在最后输出每个老师的散列?
Kiba 作者在这里!
我看到至少有两种方法可以做到这一点(无论你是使用普通 Ruby 还是使用 Kiba):
- 将您的 HTML 转换为 table,然后使用该数据
- 直接使用 HTML table(使用 Nokogiri 和选择器),仅适用于 HTML 大部分是干净的
在所有情况下,因为你正在做一些抓取;我建议你有一个非常防御性的代码(因为 HTML 更改并且以后可能包含错误或 cornercases),例如关于行/列包含您期望的内容、验证等这一事实的有力断言
如果你选择简单 Ruby,那么你可以做类似的事情(这里将你的数据建模为用逗号分隔的文本以保持清晰):
task :default do
data = <<DOC
Blocks , Teacher 1 , Teacher 2 , Teacher 3
3:00 pm , Stu A , Stu B ,
3:10 pm , Stu B , , Stu C
DOC
data = data.split("\n").map &->(x) { x.split(",").map(&:strip)}
blocks, *teachers = data.transpose
teachers.each do |teacher|
pp blocks.zip(teacher)
end
end
这将输出:
[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]
[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]
[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]
您可以根据自己的期望进行调整(但同样:要非常防御并在所有数据上到处断言,包括 table 中的单元格数量等,否则您将下车-逐一错误、不正确的时间表等)。
如果你想使用 Kiba 和 CSS 选择器,你可以这样做:
task :default do
html = <<HTML
<table>
<tr>
<th>Blocks</th>
<th>Teacher 1</th>
<th>Teacher 2</th>
<th>Teacher 3</th>
</tr>
<tr>
<td>3:00 pm</td>
<td>Stu A</td>
<td>Stu B</td>
<td></td>
</tr>
<tr>
<td>3:10 pm</td>
<td>Stu B</td>
<td></td>
<td>Stu C</td>
</tr>
</table>
HTML
require 'nokogiri'
require 'kiba'
require 'kiba-common/sources/enumerable'
require 'kiba-common/transforms/enumerable_exploder'
Kiba.run do
# just one doc here, but we could have a sequence instead
source Kiba::Common::Sources::Enumerable, -> { [html] }
transform { |r| Nokogiri::HTML(r) }
transform do |doc|
Enumerator.new do |y|
blocks, *teachers = doc.search("table tr:first th").map(&:text)
# you'd have to add more defensive checks here!!! important!
teachers.each_with_index do |t, i|
headers = doc.search("table>tr>:nth-child(1)").map(&:text)
data = doc.search("table>tr>:nth-child(#{i + 2})").map(&:text)
y << { teacher: t, data: headers.zip(data) }
end
end
end
transform Kiba::Common::Transforms::EnumerableExploder
transform { |r| pp r }
end
end
这会给出:
{:teacher=>"Teacher 1",
:data=>[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]}
{:teacher=>"Teacher 2",
:data=>[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]}
{:teacher=>"Teacher 3",
:data=>[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]}
我想我更喜欢两种方法的混合:首先将 HTML 转换为适当的 CSV 文件或内存中 table,然后第二步从那里转置。
假设我们有以下时间表。
schedule =<<~END
Blocks,15:00,15:10,15:55
Teacher 1,Stu A,Stu B,Stu C
Teacher 2,Stu B,Stu C,Stu A
Teacher 3,Stu C,Stu A,Stu B
END
要生成所需的哈希数组,我们需要其他信息。假设我们还给出了以下内容。
teacher_emails = {
"Teacher 1"=>"teacher.1@school.edu",
"Teacher 2"=>"teacher.2@school.edu",
"Teacher 3"=>"teacher.3@school.edu"
}
parent_emails = {
"Stu A"=> { "Parent 1"=>"stuap1@example.com",
"Parent 2"=>"stuap2@example.com" },
"Stu B"=> { "Parent"=>"stubp@example.com" },
"Stu C"=> { "Parent 1"=>"stuapc@example.com",
"Parent 2"=>"stuapc@example.com" }
}
mins_per_meeting = 8
然后我们可以进行如下处理。
blks, *sched = schedule.split(/\n/)
blks
#=> "Blocks,15:00,15:10,15:55"
sched
#=> ["Teacher 1,Stu A,Stu B,Stu C",
# "Teacher 2,Stu B,Stu C,Stu A",
# "Teacher 3,Stu C,Stu A,Stu B"]
time_blocks = blks.scan(/\d{1,2}:\D{2}/).map do |s|
hr, min = s.split(':')
mins_from_midnight = 60*(hr.to_i) + min.to_i
{ start: "%d:%02d" % mins_from_midnight.divmod(60),
{ end: "%d:%02d" % (mins_from_midnight + mins_per_meeting).divmod(60),
end
#=> [{:start=>"15:00", :end=>"15:08"},
# {:start=>"15:10", :end=>"15:18"},
# {:start=>"15:55", :end=>"16:03"},
sched.map do |s|
teacher, *students = s.split(',')
{ name: teacher,
email: teacher_emails[teacher],
appointments: time_blocks.zip(students).map do |tb,stud|
tb.merge(
{ student: stud,
attendees: parent_emails[stud].map do |par_name, par_email|
{ name: par_name, email: par_email }
end
}
)
end
}
end
#=> [{:name=>"Teacher 1", :email=>"teacher.1@school.edu",
# :appointments=>[
# {:start=>"15:00", :end=>"15:08",
# :student=>"Stu A",
# :attendees=>[
# {:name=>"Parent 1", :email=>"stuap1@example.com"},
# {:name=>"Parent 2", :email=>"stuap2@example.com"}
# ]
# },
# {:start=>"15:10", :end=>"15:18",
# :student=>"Stu B",
# :attendees=>[
# {:name=>"Parent", :email=>"stubp@example.com"}
# ]
# },
# {:start=>"15:55", :end=>"16:03",
# :student=>"Stu C",
# :attendees=>[
# {:name=>"Parent 1", :email=>"stuapc@example.com"},
# {:name=>"Parent 2", :email=>"stuapc@example.com"}
# ]
# }
# ]
# },
# {:name=>"Teacher 2", :email=>"teacher.2@school.edu",
# :appointments=>[
# {:start=>"15:00", :end=>"15:08",
# :student=>"Stu B",
# :attendees=>[
# {:name=>"Parent", :email=>"stubp@example.com"}
# ]
# },
# ....