如何在 Python 中使用正则表达式从文件底部读取 HTML 标签？

Question

我有一个 HTML 响应，我需要获取页面上最后一个 <title> 标记之间的数据，有没有一种方法可以在 [=21= 中使用正则表达式来完成此操作] 或使用 Python?

中的其他工具

例如

<title>abc
</title>

<title>def
</title>

应该return def.

Answer 1

使用 <title>\s*([\s\S]+?)\s*</title> 作为您的正则表达式（从标题中去除前导和尾随空格）和 findall 并取最后一次出现：

Regex Demo

import re

text = """abc
<title>abc
</title>
def
ghi
<title>def
</title>
jkl
"""

tags = re.findall(r'<title>\s*([\s\S]+?)\s*</title>', text)
print(tags[-1]) # the last one

打印：

def

Answer 2

您不应使用正则表达式来解析 HTML，因为大多数情况下效率低下且难以阅读。如果您没有任何其他选择，Regex 应该是最后的选择。查看 here 了解更多信息。

谢天谢地，Python 有很多 HTML 解析器，例如 BeautifulSoup。

使用 BeautifulSoup 你可以获得最后一个标题标签：

last_title = soup.find_all('title')[-1].text.replace('\n', '')

如何在 Python 中使用正则表达式从文件底部读取 HTML 标签？

How can I use regex in Python to read between HTML tags from the bottom of the file?

html

python

regex

parsing

response