BeautifulSoup 无法完美解析
BeautifulSoup not able to parse perfectly
当我使用 soup.find("h3", text="Main Address:").find_parents("section")
时,我得到的输出是:
[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
<span class="postal-code">36104</span></p> </section>]
现在我只想打印段落的文本。我做不到。请告诉我如何从此处仅打印本节本段内的文本。
或者我的HTML页面是这样的:
<article>
<header>
<h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Official Name:</h3></header>
<p><a href="http://alaska.gov/">Alaska</a>
</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 class="org">Governor:</h3></header>
<p><a href="http://gov.alaska.gov/Walker/contact/email-the-governor.html">Bill Walker</a></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Main Address:</h3></header>
<p>120 East 4th Street<br>
<span class="locality">Juneau</span>,
<span class="region">AK</span>,
<span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 itemprop="name">Phone Number:</h3></header>
<p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
<header><h2 id="state-agencies">State Agencies</h2></header>
<ul>
<li><a href="/state-consumer/alaska">Consumer Protection Offices</a></li>
<li><a href="http://www.correct.state.ak.us/">Corrections Department</a></li>
<li><a href="http://www.elections.alaska.gov/">Election Office</a></li>
<li><a href="http://doa.alaska.gov/dmv/">Motor Vehicle Offices</a></li>
<li><a href="http://doa.alaska.gov/dgs/property/">Surplus Property Sales</a></li>
<li><a href="http://www.travelalaska.com">Travel and Tourism</a></li>
</ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>
我应该如何从纯文本中获取地址。
您当前的代码return是一个只有一个元素的列表。要得到里面的<p>
元素,可以稍微展开一下:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")
如果您想获取 p 元素中的内容,您必须再次获取该列表的第一个元素,然后运行 decode_contents:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")
在你的情况下 return:
u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'
当我使用 soup.find("h3", text="Main Address:").find_parents("section")
时,我得到的输出是:
[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
<span class="postal-code">36104</span></p> </section>]
现在我只想打印段落的文本。我做不到。请告诉我如何从此处仅打印本节本段内的文本。
或者我的HTML页面是这样的:
<article>
<header>
<h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Official Name:</h3></header>
<p><a href="http://alaska.gov/">Alaska</a>
</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 class="org">Governor:</h3></header>
<p><a href="http://gov.alaska.gov/Walker/contact/email-the-governor.html">Bill Walker</a></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Main Address:</h3></header>
<p>120 East 4th Street<br>
<span class="locality">Juneau</span>,
<span class="region">AK</span>,
<span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 itemprop="name">Phone Number:</h3></header>
<p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
<header><h2 id="state-agencies">State Agencies</h2></header>
<ul>
<li><a href="/state-consumer/alaska">Consumer Protection Offices</a></li>
<li><a href="http://www.correct.state.ak.us/">Corrections Department</a></li>
<li><a href="http://www.elections.alaska.gov/">Election Office</a></li>
<li><a href="http://doa.alaska.gov/dmv/">Motor Vehicle Offices</a></li>
<li><a href="http://doa.alaska.gov/dgs/property/">Surplus Property Sales</a></li>
<li><a href="http://www.travelalaska.com">Travel and Tourism</a></li>
</ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>
我应该如何从纯文本中获取地址。
您当前的代码return是一个只有一个元素的列表。要得到里面的<p>
元素,可以稍微展开一下:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")
如果您想获取 p 元素中的内容,您必须再次获取该列表的第一个元素,然后运行 decode_contents:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")
在你的情况下 return:
u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'