Nokogiri 遍历 tr 标签太多次
Nokogiri iterating over tr tags too many times
我正在抓取此页面 https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig 并且对于每个 tr
我正在收集并 return 级别名称和可用计算机数量。
问题是它被迭代了太多次。只有 4 个 tr
标签,但循环经历了 5 次迭代。这会导致将额外的 nil
附加到 return 数组。这是为什么?
抄袭部分:
<table class="chart">
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl1">Level 1</a></td>
<td class="middle"><div style="width:68%;"><strong>68%</strong></div></td>
<td class="right">23 Free of 34 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl2">Level 2</a></td>
<td class="middle"><div style="width:78%;"><strong>78%</strong></div></td>
<td class="right">83 Free of 107 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl4">Level 4</a></td>
<td class="middle"><div style="width:64%;"><strong>64%</strong></div></td>
<td class="right">9 Free of 14 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl5">Level 5</a></td>
<td class="middle"><div style="width:97%;"><strong>97%</strong></div></td>
<td class="right">28 Free of 29 PC's</td>
</tr>
</table>
简化方法:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
library_name = details_page.css("h3")
details_page.css("table tr").collect do |level|
case level.css("a[href]").text.downcase
when "level 1"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
when "level 2"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
end
end
end
你可以指定table的class属性,然后访问里面的tr
标签,这样就避免了"additional" tr,比如:
details_page.css("table.chart tr").map do |level|
...
并稍微简化一下 scrape_details_page
方法:
def scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css('table.chart tr').map do |level|
right = level.css('.right').text.split
{ name: level.css('a[href]').text, total_available: right[0], out_of_available: right[3] }
end
end
p scrape_details_page('https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig')
# [{:name=>"Level 1", :total_available=>"22", :out_of_available=>"34"},
# {:name=>"Level 2", :total_available=>"98", :out_of_available=>"107"},
# {:name=>"Level 4", :total_available=>"12", :out_of_available=>"14"},
# {:name=>"Level 5", :total_available=>"26", :out_of_available=>"29"}]
我正在抓取此页面 https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig 并且对于每个 tr
我正在收集并 return 级别名称和可用计算机数量。
问题是它被迭代了太多次。只有 4 个 tr
标签,但循环经历了 5 次迭代。这会导致将额外的 nil
附加到 return 数组。这是为什么?
抄袭部分:
<table class="chart">
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl1">Level 1</a></td>
<td class="middle"><div style="width:68%;"><strong>68%</strong></div></td>
<td class="right">23 Free of 34 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl2">Level 2</a></td>
<td class="middle"><div style="width:78%;"><strong>78%</strong></div></td>
<td class="right">83 Free of 107 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl4">Level 4</a></td>
<td class="middle"><div style="width:64%;"><strong>64%</strong></div></td>
<td class="right">9 Free of 14 PC's</td>
</tr>
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=Duhig&room=Lvl5">Level 5</a></td>
<td class="middle"><div style="width:97%;"><strong>97%</strong></div></td>
<td class="right">28 Free of 29 PC's</td>
</tr>
</table>
简化方法:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
library_name = details_page.css("h3")
details_page.css("table tr").collect do |level|
case level.css("a[href]").text.downcase
when "level 1"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
when "level 2"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
end
end
end
你可以指定table的class属性,然后访问里面的tr
标签,这样就避免了"additional" tr,比如:
details_page.css("table.chart tr").map do |level|
...
并稍微简化一下 scrape_details_page
方法:
def scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css('table.chart tr').map do |level|
right = level.css('.right').text.split
{ name: level.css('a[href]').text, total_available: right[0], out_of_available: right[3] }
end
end
p scrape_details_page('https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig')
# [{:name=>"Level 1", :total_available=>"22", :out_of_available=>"34"},
# {:name=>"Level 2", :total_available=>"98", :out_of_available=>"107"},
# {:name=>"Level 4", :total_available=>"12", :out_of_available=>"14"},
# {:name=>"Level 5", :total_available=>"26", :out_of_available=>"29"}]