使用 Python 中的 BeautifulSoup 提取两个 header 标签之间的文本

Extracting the text between two header tags using BeautifulSoup in Python

我正在尝试使用 BeautifulSoup 从 Python 中的维基百科页面提取电影情节。我是 Python 和 BeautifulSoup 的新手,所以我不确定如何实际处理它。

这是输入密码。

<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>A small <a href="/wiki/Pounamu" title="Pounamu">pounamu</a> stone that is    the mystical heart of the island <a href="/wiki/Goddess" title="Goddess">goddess</a> Te Fiti is stolen by the <a href="/wiki/Demigod" title="Demigod">demigod</a> <a href="/wiki/M%C4%81ui_(mythology)" title="Māui (mythology)">Maui</a>, who was planning to give it to humanity as a gift. As Maui makes his escape, he is attacked by the lava <a href="/wiki/Demon" title="Demon">demon</a> Te Kā, causing the heart of Te Fiti as well as his power-granting magical fish hook to be lost in the ocean.</p><p>A millennium later, young Moana Waialiki, daughter and heir of the chief on the small <a href="/wiki/Polynesia" title="Polynesia">Polynesian</a> island of Motunui, is chosen by the ocean to receive the heart, but drops it when her father, Chief Tui, comes to get her. He insists the island provides everything the villagers need. But years later, fish become scarce and the island's vegetation begins dying. Moana proposes going beyond the reef to find more fish. Tui rejects her request, as sailing past the reef is forbidden.</p>`
<p>Moana's grandmother Tala shows Moana a secret cave behind a waterfall, where she finds boats inside and discovers her ancestors were voyagers, sailing and discovering new islands across the world. Tala explains that they stopped voyaging because Maui stole the heart of Te Fiti, causing Te Kā and monsters to appear in the ocean. Tala then says Te Kā's darkness has been spreading from island to island, slowly killing them. Tala gives Moana the heart of Te Fiti, which she has kept safe for her granddaughter.</p>
<p>Tala falls ill and with her dying breaths tells Moana to set sail. Moana and her pet <a href="/wiki/Rooster" title="Rooster">rooster</a> Heihei depart in a <a href="/wiki/Drua" title="Drua">drua</a> to find Maui. A <a href="/wiki/Manta_ray" title="Manta ray">manta ray</a>, Tala's reincarnation, follows. After a <a href="/wiki/Typhoon" title="Typhoon">typhoon</a> wave flips her sailboat and knocks her unconscious, she awakens the next morning on an island inhabited by Maui, who traps her in a cave and takes her sailboat to search for his fishhook. After escaping and catching up to Maui, Moana tries to convince him to return the heart, but Maui refuses, fearing its power will attract dark creatures.</p>
<p>Sentient coconut pirates called Kakamora surround the boat and steal the heart, but Maui and Moana retrieve it. Maui agrees to help return the heart, but only after he reclaims his hook, which is hidden in Lalotai, the Realm of Monsters. At Lalotai, they retrieve it by tricking Tamatoa, a giant <a href="/wiki/Coconut_crab" title="Coconut crab">coconut crab</a>. Maui teaches Moana how to properly sail and navigate. They arrive at Te Fiti, where Te Kā attacks. Maui is overpowered and Te Kā severely damages his hook and repels their boat far out to sea. Fearful that returning to fight Te Kā will destroy his hook, Maui abandons Moana.</p>
<p>Distraught, Moana begs the ocean to take the heart and choose another person to return it to Te Fiti. The spirit of Tala comes to her and encourages to find her true calling within herself. Inspired, Moana retrieves the heart from the ocean and returns to Te Fiti alone. Maui, having had a change of heart, returns to distract the lava demon, and his hook is destroyed in the battle. Realizing that Te Kā is actually Te Fiti without her heart, Moana asks the ocean to clear a path for Te Kā to approach her. She sings a song, asking Te Kā to remember who she truly is, allowing Moana to restore her heart. Te Fiti returns and gives a new canoe to Moana and a new magical hook to Maui before returning to her island form.</p>
<p>In a <a href="/wiki/Post-credits_scene" title="Post-credits scene">post-credits scene</a>, Tamatoa, who has been stranded on his back during Moana and Maui's escape, grumbles to the audience that they would help him if he was a <a href="/wiki/Sebastian_(Disney)" title="Sebastian (Disney)">Jamaican crab named Sebastian</a>.</p>
<h2><span class="mw-headline" id="Cast">Cast</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Moana_(2016_film)&amp;action=edit&amp;section=2" title="Edit section: Cast">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<div class="thumb tright">

所以我只想提取两个 h2 之间的文本,也就是情节。我应该如何使用 BeautifulSoup 提取它?

编辑 1:这是我现在拥有的代码。

from BeautifulSoup import *

movie = raw_input('Enter:')
main = "http://www.wikipedia.org"
url = "http://www.wikipedia.org/wiki/"+movie+"_(disambiguation)"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags
# Each tag is like a dictionary of HTML attributes
tags = soup('a')
for tag in tags:
    chk = tag.get('href', None)
    chk = str(chk)
    if "film" in chk:
        final = chk

html = urllib.urlopen(main+final).read()
soup = BeautifulSoup(html)
new = []
spa = soup.findAll("span",id = "Plot")
spa_1 = soup.findAllNext("p")
for i in spa_1:
    print i

我尝试访问 id=Plot 并尝试打印其后的所有 p 标签。

文档的结构是这样的:

[h2] / [span id=Plot]
...
[h2]

我们可以做的是搜索 id 为 "Plot" 的 span,然后浏览 parent 兄弟节点,收集它们的文本,直到我们到达下一个 H2 header.

# collect plot in this list
plot = []

# find the node with id of "Plot"
mark = soup.find(id="Plot")

# walk through the siblings of the parent (H2) node 
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        plot.append(elt.text)

# enjoy
print("".join(plot))