如何从 Python 中删除 HTML、网址

Question

我有这个 xml 文件列表。现在我必须从中过滤掉一些标签。问题是文本，里面有很多 html 标记和 url，我需要纯文本。我想循环删除这些元素，然后将清理后的文本附加到我的新列表中。这是我目前所拥有的。

    data = []
    for conv in root.findall('./conversations/conversation'):
        pattern = re.compile( r'!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\\\+&amp;%$#\=~_\-]+))*\b!i')
        if pattern.search(conv.text):
           re.sub(pattern, ' ')
           data.append(conv.text)

我找不到合适的正则表达式来删除像这样的东西 br />;<br /> 和像这样的 url：http://neocash43.blog.com/2011/07/26/psp-sport-assessment-neopets-the-wand-of-wishing/</a>

第二个问题是，对于这个 xml 根结构，我现在不知道如何将清理过的对话文本附加到我的新列表中。

Answer 1

pattern.web python modules has an HTML to text function, which called plaintext。默认情况下，此函数会删除所有 HTML 标签。对于 URL，请使用现有的 RegEx。

Answer 2

您可以尝试使用 pyparsing 库的 http://pyparsing.wikispaces.com/file/view/htmlStripper.py/591745692/htmlStripper.py。我刚刚在我的机器上使用这个脚本 Python 3.4.

如何从 Python 中删除 HTML、网址

How to remove HTML, Urls from with Python

html

python

regex

xml

text-classification