使用 beautifulsoup 从 span class 标签中提取文本
Extracting text from span class tag with beautifulsoup
我正在尝试从网站中提取跨度 class 之间的一些文本元素。
这是 HTML 代码的片段:
<h1>2 Some address</h1>
</div>
<div id="smi-summary-items">
<div id="smi-price-string">€230,000</div>
<span class="header_text"> Detached House</span><span class="bar"> | </span><span class="header_text">3 Beds</span><span class="bar"> | </span><span class="header_text">2 Baths</span>
<!-- Text_Link_Full_Ad_Unit -->
<div id='dfp-text_link_full_ad_unit' class='sale'>
<script type='text/javascript'>
googletag.cmd.push(function()
{
googletag.display('dfp-text_link_full_ad_unit');
}
);
</script>
</div>
我想提取“3 Beds”和“2 Baths”的文本。
我尝试了一些解决方案,但主要是出现错误或结果为空。
谁能提出解决方案?
据我了解,您可以简单地通过 class:
过滤所需的元素
[item.get_text() for item in soup.select("span.header_text")]
完整的工作示例代码:
from bs4 import BeautifulSoup
data = """
<div id="smi-summary-items">
<div id="smi-price-string">€230,000</div>
<span class="header_text"> Detached House</span><span class="bar"> | </span><span class="header_text">3 Beds</span><span class="bar"> | </span><span class="header_text">2 Baths</span>
<!-- Text_Link_Full_Ad_Unit -->
<div id='dfp-text_link_full_ad_unit' class='sale'>
<script type='text/javascript'>
googletag.cmd.push(function()
{
googletag.display('dfp-text_link_full_ad_unit');
}
);
</script>
</div>"""
soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.header_text")])
产生:
['Detached House', '3 Beds', '2 Baths']
以下代码用于从网站
中提取跨度 class 之间的一些文本元素
>>> from bs4 import BeautifulSoup
>>> import re
>>> content = """<h1>2 Some address</h1>
... </div>
... <div id="smi-summary-items">
... <div id="smi-price-string">€230,000</div>
... <span class="header_text"> Detached House</span>
<span class="bar"> | </span><span class="header_text">3
Beds</span><span class="bar"> | </span><span class="header_text">2
Baths</span>
... <!-- Text_Link_Full_Ad_Unit -->
... <div id='dfp-text_link_full_ad_unit' class='sale'>
... <script type='text/javascript'>
... googletag.cmd.push(function()
... {
... googletag.display('dfp-
text_link_full_ad_unit');
... }
... );
... </script>
... </div>"""
>>> soup = BeautifulSoup(content, "html.parser")
>>> req = soup.find_all("span", {"class":"header_text"})
>>> print(req)
[<span class="header_text"> Detached House</span>, <span
class="header_text">3 Beds</span>, <span class="header_text">2 Baths</span>]
>>> x23 = []
>>> for i in req:
... x23.append(i.get_text())
...
>>> print(x23)
[' Detached House', '3 Beds', '2 Baths']
我正在尝试从网站中提取跨度 class 之间的一些文本元素。
这是 HTML 代码的片段:
<h1>2 Some address</h1>
</div>
<div id="smi-summary-items">
<div id="smi-price-string">€230,000</div>
<span class="header_text"> Detached House</span><span class="bar"> | </span><span class="header_text">3 Beds</span><span class="bar"> | </span><span class="header_text">2 Baths</span>
<!-- Text_Link_Full_Ad_Unit -->
<div id='dfp-text_link_full_ad_unit' class='sale'>
<script type='text/javascript'>
googletag.cmd.push(function()
{
googletag.display('dfp-text_link_full_ad_unit');
}
);
</script>
</div>
我想提取“3 Beds”和“2 Baths”的文本。
我尝试了一些解决方案,但主要是出现错误或结果为空。
谁能提出解决方案?
据我了解,您可以简单地通过 class:
过滤所需的元素[item.get_text() for item in soup.select("span.header_text")]
完整的工作示例代码:
from bs4 import BeautifulSoup
data = """
<div id="smi-summary-items">
<div id="smi-price-string">€230,000</div>
<span class="header_text"> Detached House</span><span class="bar"> | </span><span class="header_text">3 Beds</span><span class="bar"> | </span><span class="header_text">2 Baths</span>
<!-- Text_Link_Full_Ad_Unit -->
<div id='dfp-text_link_full_ad_unit' class='sale'>
<script type='text/javascript'>
googletag.cmd.push(function()
{
googletag.display('dfp-text_link_full_ad_unit');
}
);
</script>
</div>"""
soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.header_text")])
产生:
['Detached House', '3 Beds', '2 Baths']
以下代码用于从网站
中提取跨度 class 之间的一些文本元素>>> from bs4 import BeautifulSoup
>>> import re
>>> content = """<h1>2 Some address</h1>
... </div>
... <div id="smi-summary-items">
... <div id="smi-price-string">€230,000</div>
... <span class="header_text"> Detached House</span>
<span class="bar"> | </span><span class="header_text">3
Beds</span><span class="bar"> | </span><span class="header_text">2
Baths</span>
... <!-- Text_Link_Full_Ad_Unit -->
... <div id='dfp-text_link_full_ad_unit' class='sale'>
... <script type='text/javascript'>
... googletag.cmd.push(function()
... {
... googletag.display('dfp-
text_link_full_ad_unit');
... }
... );
... </script>
... </div>"""
>>> soup = BeautifulSoup(content, "html.parser")
>>> req = soup.find_all("span", {"class":"header_text"})
>>> print(req)
[<span class="header_text"> Detached House</span>, <span
class="header_text">3 Beds</span>, <span class="header_text">2 Baths</span>]
>>> x23 = []
>>> for i in req:
... x23.append(i.get_text())
...
>>> print(x23)
[' Detached House', '3 Beds', '2 Baths']