如何获取没有 HTML 标签的文本

Question

以下是HTML：

<div class="ajaxcourseindentfix">
    <h3>CPSC 353 - Introduction to Computer Security (3) </h3>
    <hr>Security goals, security systems, access controls, networks and security, integrity, cryptography fundamentals, authentication. Attacks: software, network, website; management considerations, security standards in government and industry; security issues in requirements, architecture, design, implementation, testing, operation, maintenance, acquisition, and services.
    <br>
    <br>Prerequisite: <a href="preview_course_nopop.php?catoid=16&amp;coid=96570" onclick="acalogPopup()">CPSC 253U</a>
    <span style="display: none !important">&nbsp;</span>&nbsp;or <a href="#" onclick="acalogPopup()" target="_blank">CPSC 254</a>
    <span style="display: none !important">&nbsp;</span>&nbsp;and <a href="#" onclick="acalogPopup()" target="_blank">CPSC 351</a>
    <span style="display: none !important">&nbsp;</span>
    , declared major/minor in CPSC, CPEN, or CPEI
    <br>
</div>

我需要从这个 HTML 中获取以下文本：

来自第 6 行 - 或
从第 7 行 - and
，在 CPSC、CPEN 或 CPEI

中声明 major/minor

我可以使用以下 XPath 获取 href [课程编号：CPSC 254 等...]：

 # This xpath gives me all the tags followed by h3 and then I iterate through them in my script.  
//div[@class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::*

更新

然后是具有以下 XPath 的文本：

# This xpath gives me all the text after the h3 tag.  
//div[@class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::text()

我需要以与 URL 1.

相同的方式学习这些课程 name/prerequisite

在这种方法中，我首先获取所有 HREF，然后是所有文本。有没有更好的方法来实现这一目标？我不想迭代 2 个 XPath 以首先获取 HREF，然后是文本，然后将它们合并以形成先决条件字符串。

1 http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=99648&show

Answer 1

尝试使用以下代码获得所需的输出：

div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]

输出为

'CPSC 253U or CPSC 254 and CPSC 351 , declared major/minor in CPSC, CPEN, or CPEI'

如何获取没有 HTML 标签的文本

How to get text which has no HTML tag

xpath

beautifulsoup

ixmldomelement