如何使用 Nokogiri 解析页面的 HTML 内容
How to to parse HTML contents of a page using Nokogiri
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)
我正在尝试获取基本信息集,例如:
event_name
categories
sponsor
venue
event_location
cost
例如,对于 event_name
我有这个 xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"
并像这样使用它:
puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"
returns 无 event_name
。
如果我在本地保存 URL 内容,则上面的 XPath 有效。
除此之外,我还需要上述信息。我也检查了其他的XPaths,但结果是空白。
提供的 link 包含 XML,因此您的 XPath 表达式应该使用 XML 结构。
关键是文档有命名空间。据我了解,所有 XPath 表达式都应牢记这一点并指定命名空间。
为了简化 XPath 表达式,可以使用 remove_namespaces!
方法:
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output
doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event
event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"
您很可能希望拥有所有事件哈希的数组。
你可以这样做:
doc.xpath('//feed/entry').reduce([]) do |memo, event|
event_hash = {
title: event.xpath('./title').text,
categories: event.xpath('./categories').text
# all other attributes you need ...
}
memo << event_hash
end
它会给你一个像这样的数组:
[
{:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"},
{:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},
...
]
以下是我将如何执行此操作:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
其中,当 运行 时,导致 entries
成为哈希数组:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder@si.edu and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
我没有尝试收集您要求的具体信息,因为 event_name
不存在,您所做的非常通用,一旦您了解一些规则就可以轻松完成。
XML 通常是非常重复的,因为它代表了 table 的数据。 table 的 "cells" 可能会有所不同,但您可以使用重复来帮助您。在这段代码中
doc.search('entry')
遍历 <entry>
个节点。然后很容易查看它们的内部以找到所需的信息。
XML 使用命名空间来帮助避免标记名冲突。起初这些看起来真的很难,但是 Nokogiri 为文档提供了 collect_namespaces
方法,即 returns 文档中所有名称空间的散列。如果您正在寻找命名空间标签,请将该散列作为第二个参数传递。
Nokogiri 允许我们使用 XPath 和 CSS 作为选择器。为了可读性,我几乎总是选择 CSS 。 ns|tag
是告诉 Nokogiri 使用基于 CSS 的命名空间标签的格式。同样,将文档中命名空间的散列传递给它,Nokogiri 将完成剩下的工作。
如果您熟悉使用 Nokogiri,您会发现上面的代码与用于提取 <tr>
行内 <td>
单元格内容的普通代码非常相似 HTML <table>
。
您应该能够修改该代码以收集您需要的数据,而不会冒命名空间冲突的风险。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)
我正在尝试获取基本信息集,例如:
event_name
categories
sponsor
venue
event_location
cost
例如,对于 event_name
我有这个 xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"
并像这样使用它:
puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"
returns 无 event_name
。
如果我在本地保存 URL 内容,则上面的 XPath 有效。
除此之外,我还需要上述信息。我也检查了其他的XPaths,但结果是空白。
提供的 link 包含 XML,因此您的 XPath 表达式应该使用 XML 结构。
关键是文档有命名空间。据我了解,所有 XPath 表达式都应牢记这一点并指定命名空间。
为了简化 XPath 表达式,可以使用 remove_namespaces!
方法:
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output
doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event
event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"
您很可能希望拥有所有事件哈希的数组。
你可以这样做:
doc.xpath('//feed/entry').reduce([]) do |memo, event|
event_hash = {
title: event.xpath('./title').text,
categories: event.xpath('./categories').text
# all other attributes you need ...
}
memo << event_hash
end
它会给你一个像这样的数组:
[
{:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"},
{:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},
...
]
以下是我将如何执行此操作:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
其中,当 运行 时,导致 entries
成为哈希数组:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder@si.edu and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
我没有尝试收集您要求的具体信息,因为 event_name
不存在,您所做的非常通用,一旦您了解一些规则就可以轻松完成。
XML 通常是非常重复的,因为它代表了 table 的数据。 table 的 "cells" 可能会有所不同,但您可以使用重复来帮助您。在这段代码中
doc.search('entry')
遍历 <entry>
个节点。然后很容易查看它们的内部以找到所需的信息。
XML 使用命名空间来帮助避免标记名冲突。起初这些看起来真的很难,但是 Nokogiri 为文档提供了 collect_namespaces
方法,即 returns 文档中所有名称空间的散列。如果您正在寻找命名空间标签,请将该散列作为第二个参数传递。
Nokogiri 允许我们使用 XPath 和 CSS 作为选择器。为了可读性,我几乎总是选择 CSS 。 ns|tag
是告诉 Nokogiri 使用基于 CSS 的命名空间标签的格式。同样,将文档中命名空间的散列传递给它,Nokogiri 将完成剩下的工作。
如果您熟悉使用 Nokogiri,您会发现上面的代码与用于提取 <tr>
行内 <td>
单元格内容的普通代码非常相似 HTML <table>
。
您应该能够修改该代码以收集您需要的数据,而不会冒命名空间冲突的风险。