如何使用python'美汤获取标签与HTML结尾之间的内容？

Question

我有一个 HTML 行如下：

<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>

我想从 HTML 行中提取标题，即 "Is this model too thin for Yves Saint Laurent?"。如何获取

之间的任何内容

<tag> and </tag>.

我不太熟悉正则表达式。

Answer 1

您应该使用一些 html 解析器，例如 BeautifulSoup，而不是使用正则表达式。您还可以将 etree 库与 xpath 一起用于复杂的用例。

不过，如果你想使用正则表达式 -

正则表达式是一种域特定语言，它使字符串解析和处理变得更加容易。虽然，有些人可能不同意正则表达式为问题提供了很多优雅的解决方案，但对字符串进行循环可能永远是。-

import re
html_string = '<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>'
regex = re.compile(r'(?<=>).*(?=<)')
result = regex.findall(html_string)[0]

在这个正则表达式中，我使用了正则表达式的前视和后视。就学习正则表达式而言，需要花费相当多的时间。我建议阅读一些关于正则表达式的好教程或书籍。

Answer 2

如果您的元素包含仅文本，请使用 .string attribute:

headline = soup.find(class_='cd__headline-text')
print(headline.string)

如果包含其他标签，您可以获取当前元素中包含的所有文本并进一步获取，或者仅获取当前元素中的特定文本。

element.get_text() function 将递归并收集元素和子元素中的所有字符串，将它们与您选择的字符串（默认为空字符串）连接起来，并进行或不进行空格剥离。

要仅获取特定字符串，您可以遍历 .strings or .stripped_strings generators, or use the element contents 以访问所有包含的元素，然后选择 NavigableString 类型的实例。

使用您的示例进行演示：

>>> from bs4 import BeautifulSoup
>>> markup = '<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> print headline.string
Is this model too thin for Yves Saint Laurent? 
>>> print list(headline.strings)
[u'Is this model too thin for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model too thin for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent? 
>>> print headline.get_text(strip=True)
Is this model too thin for Yves Saint Laurent?

并添加了一个附加元素：

>>> markup = '<span class="cd__headline-text">Is this model <em>too thin</em> for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> headline.string is None
True
>>> print list(headline.strings)
[u'Is this model ', u'too thin', u' for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model', u'too thin', u'for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent? 
>>> print headline.get_text(' - ', strip=True)
Is this model - too thin - for Yves Saint Laurent?
>>> headline.contents
[u'Is this model ', <em>too thin</em>, u' for Yves Saint Laurent? ']
>>> from bs4 import NavigableString
>>> [el for el in headline.children if isinstance(el, NavigableString)]
[u'Is this model ', u' for Yves Saint Laurent? ']

如何使用python'美汤获取标签与HTML结尾之间的内容？

How to obtain the content between a tag and it's ending in HTML using python' beautiful soup?

python

beautifulsoup

web-scraping