验证字符串是否为 python 中的有效 HTML？

Question

找出字符串包含语法正确的有效 html 的最佳技术是什么？

我尝试查看模块 html.parser 中的 HTMLParser，如果它在解析过程中没有产生任何错误，我断定该字符串是有效的 HTML 。但是它对我没有帮助，因为它甚至在不引发任何错误的情况下解析无效字符串。

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

我预计它会抛出一些异常或错误，因为缺少结束标记，但它没有。

Answer 1

    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False

Answer 2

来自 html.parser 的传统 HTML 解析器不验证来自 HTML 标记的错误，它仅 "tokenize" 字符串中的每个内容。

您可能想看看 py_w3c。貌似没有人看这个模块，但是确实能有效的识别错误：

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))

$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.

验证字符串是否为 python 中的有效 HTML？

Validating if a string is a valid HTML in python?

python

html-parsing