BeautifulSoup 无法完美解析

Question

当我使用 soup.find("h3", text="Main Address:").find_parents("section") 时，我得到的输出是：

[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
 <span class="postal-code">36104</span></p> </section>]

现在我只想打印段落的文本。我做不到。请告诉我如何从此处仅打印本节本段内的文本。

或者我的HTML页面是这样的：

<article>
<header>
    <h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
    <header><h3  itemprop="name">Official Name:</h3></header>
    <p><a href="http://alaska.gov/">Alaska</a>
    </p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
    <header><h3  class="org">Governor:</h3></header>
    <p><a href="http://gov.alaska.gov/Walker/contact/email-the-governor.html">Bill Walker</a></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
    <header><h3  itemprop="name">Main Address:</h3></header>
    <p>120 East 4th Street<br>
        <span class="locality">Juneau</span>, 
        <span class="region">AK</span>, 
        <span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
    <header><h3  itemprop="name">Phone Number:</h3></header>
    <p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
    <span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
    <header><h2 id="state-agencies">State Agencies</h2></header>
    <ul>
        <li><a href="/state-consumer/alaska">Consumer Protection Offices</a></li>
        <li><a href="http://www.correct.state.ak.us/">Corrections Department</a></li>
        <li><a href="http://www.elections.alaska.gov/">Election Office</a></li>
        <li><a href="http://doa.alaska.gov/dmv/">Motor Vehicle Offices</a></li>
        <li><a href="http://doa.alaska.gov/dgs/property/">Surplus Property Sales</a></li>
        <li><a href="http://www.travelalaska.com">Travel and Tourism</a></li>
    </ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
    <span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>

我应该如何从纯文本中获取地址。

Answer 1

您当前的代码return是一个只有一个元素的列表。要得到里面的<p>元素，可以稍微展开一下：

soup.find("h3", text="Main Address:").find_parents("section")[0]("p")

如果您想获取 p 元素中的内容，您必须再次获取该列表的第一个元素，然后运行 decode_contents:

soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")

在你的情况下 return:

u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'

BeautifulSoup 无法完美解析

BeautifulSoup not able to parse perfectly

beautifulsoup

python-2.7