Nokogiri 获取所有 HTML 个节点
Nokogiri get all HTML nodes
我想使用 Nokogiri 从 HTML 文档中获取所有节点。
示例 HTML 输入字符串:
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
预期输出:
['<html>','<body>','<h1>','</h1>','<p>','<strong>','</strong>','</p>','</body>','</html>']
结束标签和正确的顺序很重要!
我已经试过这段代码了:
require 'nokogiri'
string_page = "<html><body><h1>Header1</h1></body></html>"
doc = Nokogiri::HTML(string_page)
doc.search('*').map(&:name)
# => ["html", "body", "h1"]
但它不会 return 结束标签。
您可以将 OuterXml 拆分为所有非自关闭的打开元素的 InnerXml,存储相应的关闭元素(如果有)以检索它并使用 Nokogiri reader 解析文档以根据文档中的顺序。
它要求您的文档是有效的 XML 片段,因为它使用的是 XML 解析器而不是 HTML 解析器。
require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
END
].each { |string_page|
elem_all = Array.new
elem_ends = Hash.new
reader = Nokogiri::XML::Reader(string_page)
reader.each { |node|
if node.node_type.eql?(1)
if node.self_closing?
elem_all << node.outer_xml
else
elem_tags = node.outer_xml.split(node.inner_xml)
elem_all << elem_tags.first
elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
end
end
elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
}
puts string_page
puts elem_all.to_s
puts
}
输出:
<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]
<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]
您可以设置自己的结束标签,如下所示:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
EOT
p doc.search('*').map{|m| [m.name, "/#{m.name}"]}
# => [["html", "/html"], ["body", "/body"], ["h1", "/h1"], ["p", "/p"], ["strong", "/strong"]]
您实际上并没有像扫描那样解析:
str = <<EOF
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
EOF
str.scan /<.*?>/
#=> ["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]
我想使用 Nokogiri 从 HTML 文档中获取所有节点。 示例 HTML 输入字符串:
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
预期输出:
['<html>','<body>','<h1>','</h1>','<p>','<strong>','</strong>','</p>','</body>','</html>']
结束标签和正确的顺序很重要!
我已经试过这段代码了:
require 'nokogiri'
string_page = "<html><body><h1>Header1</h1></body></html>"
doc = Nokogiri::HTML(string_page)
doc.search('*').map(&:name)
# => ["html", "body", "h1"]
但它不会 return 结束标签。
您可以将 OuterXml 拆分为所有非自关闭的打开元素的 InnerXml,存储相应的关闭元素(如果有)以检索它并使用 Nokogiri reader 解析文档以根据文档中的顺序。
它要求您的文档是有效的 XML 片段,因为它使用的是 XML 解析器而不是 HTML 解析器。
require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
END
].each { |string_page|
elem_all = Array.new
elem_ends = Hash.new
reader = Nokogiri::XML::Reader(string_page)
reader.each { |node|
if node.node_type.eql?(1)
if node.self_closing?
elem_all << node.outer_xml
else
elem_tags = node.outer_xml.split(node.inner_xml)
elem_all << elem_tags.first
elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
end
end
elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
}
puts string_page
puts elem_all.to_s
puts
}
输出:
<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]
<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]
您可以设置自己的结束标签,如下所示:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
EOT
p doc.search('*').map{|m| [m.name, "/#{m.name}"]}
# => [["html", "/html"], ["body", "/body"], ["h1", "/h1"], ["p", "/p"], ["strong", "/strong"]]
您实际上并没有像扫描那样解析:
str = <<EOF
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
EOF
str.scan /<.*?>/
#=> ["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]