BeautifulSoup：当从一个部分中拉取文本时，<emph> 和其他标签被忽略，导致相邻的词被推到一起

Question

我有一份 XML 文件。我想在所有

.. <.p> 标签之间拉出所有文本。以下是文本示例。问题是在这样的句子中：

"Because the <emph>raspberry</emph> and.."

输出为 "Because theraspberryand..."。不知何故，emph 标签被删除了（这很好）但是删除的方式是将相邻的单词放在一起。

这是我正在使用的相关代码：

xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
    text = text + " " + para.text + " "

这是部分文本的开头，以防全文有所帮助：

<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs@andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &amp;lsqbVernon, D. M. &amp;
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&amp;rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>

Answer 1

我认为这里的问题是你试图用 bs3 编写 bs4 代码。

明显的解决方法是改用 bs4。

但是在 bs3 中，文档显示了两种从 soup 的所有内容中递归获取所有文本的方法：

''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))

显然，您可以更改其中任何一个以显式去除边缘的白色 space 并在每个节点之间恰好添加一个 space 而不是依赖可能存在的任何 space :

' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))

我不想保证这与 bs4 完全相同 text 属性...但我认为这就是您想要的。

BeautifulSoup：当从一个部分中拉取文本时，<emph> 和其他标签被忽略，导致相邻的词被推到一起

BeautifulSoup: when pulling text from a section, <emph> and other tags are ignored causing adjacent words to be pushed together

python

xml

parsing

beautifulsoup

python-2.7