如何从 Python 中的本地文件中的某个 XPath 中获取所有原始 html
How to grab raw all raw html within a certain XPath from a local file in Python
我正在尝试从一堆本地 html 文件中获取原始 html。我从这个 post 中获得了一些帮助来读取原始文件:
Get all text inside a tag lxml
但是我目前的代码生成整个文件而不是子集。现在我似乎缺少一行我可以选择我想要抓取的 xpath。
这是我目前拥有的代码:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c
in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
for filename in os.listdir('../news/article/'):
if (filename.endswith('.html') and not filename.startswith('._')):
print filename;
with open('../news/article/' + filename, "r") as f:
page=f.read();
tree=html.fromstring(page);
maincontent = stringify_children(tree);
print maincontent;
我的最终目标是能够在字符串中获取它并输出到本地文件,因为只有 div。
这是一个示例文件:
<html>
<head>
<title>Title</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-xs-4">
<div class="left-bar"></div>
</div>
<div class="col-xs-4">
<div class="middle-bar"></div>
</div>
<div class="col-xs-4">
<div class="right-bar"></div>
</div>
</div>
<div class="row">
<div class="col-xs-3">
<div class="navigation"></div>
</div>
<div class="col-xs-9">
<div class="main-content">
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
我想抓取主要内容下面的所有内容 class。这是此文件中 class 的 xpath:
XPath: /html/body/div/div[2]/div[2]/div
程序应输出以下内容:
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
您可以尝试使用 BeautifulSoup。我不是真正精通它,但你可以做这样的事情(或者更干净,如果你阅读 BeautifulSoup :)
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("input.html"), 'html')
x = soup.find_all(class_="main-content")
for line in x[0].contents:
print line,
你会得到这样的输出:
Hello
<br/>
<br/> <a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#"/>More content 1</div>
<div class="col-xs-4"><img src="#"/>More content 2</div>
<div class="col-xs-4"><img src="#"/>More content 3</div>
</div>
BeautifulSoup 将 "fix" HTML 语法,就像从
到
的变化一样,它会保持元素内部的间距.请参阅文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
使用 lxml:
from lxml import html
xm = html.fromstring(h)
div = xm.xpath("//div[@class='main-content']")[0]
print(div.text + "".join(map(html.tostring, div.xpath("./*"))))
或者:
from lxml import html
xm = html.fromstring(h)
eles = xm.xpath("//div[@class='main-content']/text() | //div[@class='main-content']/*")
print("".join([ele if isinstance(ele, str) else html.tostring(ele) for ele in eles]))
我正在尝试从一堆本地 html 文件中获取原始 html。我从这个 post 中获得了一些帮助来读取原始文件:
Get all text inside a tag lxml
但是我目前的代码生成整个文件而不是子集。现在我似乎缺少一行我可以选择我想要抓取的 xpath。
这是我目前拥有的代码:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c
in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
for filename in os.listdir('../news/article/'):
if (filename.endswith('.html') and not filename.startswith('._')):
print filename;
with open('../news/article/' + filename, "r") as f:
page=f.read();
tree=html.fromstring(page);
maincontent = stringify_children(tree);
print maincontent;
我的最终目标是能够在字符串中获取它并输出到本地文件,因为只有 div。
这是一个示例文件:
<html>
<head>
<title>Title</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-xs-4">
<div class="left-bar"></div>
</div>
<div class="col-xs-4">
<div class="middle-bar"></div>
</div>
<div class="col-xs-4">
<div class="right-bar"></div>
</div>
</div>
<div class="row">
<div class="col-xs-3">
<div class="navigation"></div>
</div>
<div class="col-xs-9">
<div class="main-content">
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
我想抓取主要内容下面的所有内容 class。这是此文件中 class 的 xpath:
XPath: /html/body/div/div[2]/div[2]/div
程序应输出以下内容:
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
您可以尝试使用 BeautifulSoup。我不是真正精通它,但你可以做这样的事情(或者更干净,如果你阅读 BeautifulSoup :)
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("input.html"), 'html')
x = soup.find_all(class_="main-content")
for line in x[0].contents:
print line,
你会得到这样的输出:
Hello
<br/>
<br/> <a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#"/>More content 1</div>
<div class="col-xs-4"><img src="#"/>More content 2</div>
<div class="col-xs-4"><img src="#"/>More content 3</div>
</div>
BeautifulSoup 将 "fix" HTML 语法,就像从
到
的变化一样,它会保持元素内部的间距.请参阅文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
使用 lxml:
from lxml import html
xm = html.fromstring(h)
div = xm.xpath("//div[@class='main-content']")[0]
print(div.text + "".join(map(html.tostring, div.xpath("./*"))))
或者:
from lxml import html
xm = html.fromstring(h)
eles = xm.xpath("//div[@class='main-content']/text() | //div[@class='main-content']/*")
print("".join([ele if isinstance(ele, str) else html.tostring(ele) for ele in eles]))