使用 Python 中的 BeautifulSoup 在 link 标签之间提取文本
Extracting text between link tags using BeautifulSoup in Python
我有 HTML 代码如下:
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
并且我正在尝试提取呈现此 HTML 时显示的文本。
更具体地说,对于此示例 'a' 标记,我正在尝试提取 "EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15"
但我无法提取全文,因为它被 'img' 标签和 'span' 打断了。
为了提供更多上下文,我一直在使用下面的代码搜索所有 'a' 标签并提取 link 文本。
for link in soup.find_all('a', id='mylink'):
raw.append(link)
link_text = link.contents[0].encode('utf-8')
sweeps.append(link_text)
#output: 'EZSTORAGE - PACK IT. STORE IT. WIN - '
任何见解将不胜感激!
你不能像这个 MWE 那样,使用 link.text
而不是 link.contents
text = """
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
for link in soup.find_all('a', id='mylink'):
link_text = link.text
print link_text
结果:
EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15
您可以使用正则查找所有文本
import urllib,urllib2,re
content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''
links=re.findall(r'>(.*?)<',content)
a=""
for link in links:
a=a+link
print a
return "EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15"
我有 HTML 代码如下:
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
并且我正在尝试提取呈现此 HTML 时显示的文本。
更具体地说,对于此示例 'a' 标记,我正在尝试提取 "EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15"
但我无法提取全文,因为它被 'img' 标签和 'span' 打断了。
为了提供更多上下文,我一直在使用下面的代码搜索所有 'a' 标签并提取 link 文本。
for link in soup.find_all('a', id='mylink'):
raw.append(link)
link_text = link.contents[0].encode('utf-8')
sweeps.append(link_text)
#output: 'EZSTORAGE - PACK IT. STORE IT. WIN - '
任何见解将不胜感激!
你不能像这个 MWE 那样,使用 link.text
而不是 link.contents
text = """
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
for link in soup.find_all('a', id='mylink'):
link_text = link.text
print link_text
结果:
EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15
您可以使用正则查找所有文本
import urllib,urllib2,re
content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''
links=re.findall(r'>(.*?)<',content)
a=""
for link in links:
a=a+link
print a
return "EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15"