如何提取和忽略标记中的跨度? - python
How to extract and ignore span in markup? - python
如何提取和忽略 HTML 标记中的跨度?
我的输入是这样的:
<ul class="definitions">
<li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>
期望的输出:
label = 'noun' # String embedded between <span>...</span>
meaning = 'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer' # the text without the string embedded within <span>...</span>
related_to = ['sale', 'chain', 'wine'] # String embedded between <a>...</a>
utag = ['product'] # String embedded between <u>...</u>
我试过这个:
>>> from bs4 import BeautifulSoup
>>> text = '''<ul class="definitions">
... <li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>'''
>>> bsoup = BeautifulSoup(text)
>>> bsoup.text
u'\nnoun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
# Getting the `label`
>>> label = bsoup.find('span')
>>> label
<span>noun</span>
>>> label = bsoup.find('span').text
>>> label
u'noun'
# Getting the text.
>>> bsoup.text.strip()
u'noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> bsoup.text.strip
>>> definition = bsoup.text.strip()
>>> definition = definition.partition(' ')[2] if definition.split()[0] == label else definition
>>> definition
u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
# Getting the related_to and utag
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> related_to = [r.text for r in bsoup.find_all('u')]
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> utag = [r.text for r in bsoup.find_all('u')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> utag
[u'product']
使用 BeautifulSoup 没问题,但它有点冗长以获得所需的内容。
是否有其他实现相同输出的方法?
某些组是否有正则表达式方法来捕获所需的输出?
它仍然有一个漂亮的结构良好,并且您已经清楚地说明了一套规则。我仍然会使用 BeautifulSoup
应用 "Extract Method" 重构方法来处理它:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<ul class="definitions">
<li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>
"""
def get_info(elm):
label = elm.find("span")
return {
"label": label.text,
"meaning": "".join(getattr(sibling, "text", sibling) for sibling in label.next_siblings).strip(),
"related_to": [a.text for a in elm.find_all("a")],
"utag": [u.text for u in elm.find_all("u")]
}
soup = BeautifulSoup(data, "html.parser")
pprint(get_info(soup.li))
打印:
{'label': u'noun',
'meaning': u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer',
'related_to': [u'sale', u'chain', u'wine'],
'utag': [u'product']}
PyQuery 是使用 BeautifulSoup 的另一种选择。它遵循类似 jQuery 的语法,用于从 html.
中提取信息
此外,对于正则表达式...可以使用类似下面的内容。
import re
text = """<ul class="definitions"><li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>"""
match_pattern = re.compile(r"""
(?P<label>(?<=<span>)\w+?(?=</span>)) # create the label \
item for groupdict()
""", re.VERBOSE)
match = match_pattern.search(text)
match.groupdict()
输出:
{'label': 'noun'}
使用以上内容作为模板,您也可以在其他 html 标签的基础上进行构建。它使用 (?P<name>...)
来命名匹配的模式(即标签),然后是 (?=...)
前瞻断言 和 正后视断言 进行匹配。
如何提取和忽略 HTML 标记中的跨度?
我的输入是这样的:
<ul class="definitions">
<li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>
期望的输出:
label = 'noun' # String embedded between <span>...</span>
meaning = 'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer' # the text without the string embedded within <span>...</span>
related_to = ['sale', 'chain', 'wine'] # String embedded between <a>...</a>
utag = ['product'] # String embedded between <u>...</u>
我试过这个:
>>> from bs4 import BeautifulSoup
>>> text = '''<ul class="definitions">
... <li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>'''
>>> bsoup = BeautifulSoup(text)
>>> bsoup.text
u'\nnoun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
# Getting the `label`
>>> label = bsoup.find('span')
>>> label
<span>noun</span>
>>> label = bsoup.find('span').text
>>> label
u'noun'
# Getting the text.
>>> bsoup.text.strip()
u'noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> bsoup.text.strip
>>> definition = bsoup.text.strip()
>>> definition = definition.partition(' ')[2] if definition.split()[0] == label else definition
>>> definition
u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
# Getting the related_to and utag
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> related_to = [r.text for r in bsoup.find_all('u')]
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> utag = [r.text for r in bsoup.find_all('u')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> utag
[u'product']
使用 BeautifulSoup 没问题,但它有点冗长以获得所需的内容。
是否有其他实现相同输出的方法?
某些组是否有正则表达式方法来捕获所需的输出?
它仍然有一个漂亮的结构良好,并且您已经清楚地说明了一套规则。我仍然会使用 BeautifulSoup
应用 "Extract Method" 重构方法来处理它:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<ul class="definitions">
<li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>
"""
def get_info(elm):
label = elm.find("span")
return {
"label": label.text,
"meaning": "".join(getattr(sibling, "text", sibling) for sibling in label.next_siblings).strip(),
"related_to": [a.text for a in elm.find_all("a")],
"utag": [u.text for u in elm.find_all("u")]
}
soup = BeautifulSoup(data, "html.parser")
pprint(get_info(soup.li))
打印:
{'label': u'noun',
'meaning': u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer',
'related_to': [u'sale', u'chain', u'wine'],
'utag': [u'product']}
PyQuery 是使用 BeautifulSoup 的另一种选择。它遵循类似 jQuery 的语法,用于从 html.
中提取信息此外,对于正则表达式...可以使用类似下面的内容。
import re
text = """<ul class="definitions"><li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>"""
match_pattern = re.compile(r"""
(?P<label>(?<=<span>)\w+?(?=</span>)) # create the label \
item for groupdict()
""", re.VERBOSE)
match = match_pattern.search(text)
match.groupdict()
输出:
{'label': 'noun'}
使用以上内容作为模板,您也可以在其他 html 标签的基础上进行构建。它使用 (?P<name>...)
来命名匹配的模式(即标签),然后是 (?=...)
前瞻断言 和 正后视断言 进行匹配。