如何使用 beautifulsoup 提取内容
How to extract the content using beautifulsoup
我想尝试使用 beautifulsoup 从网站中提取产品名称和价格。但是我不知道怎么提取内容。
Python代码:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)
打印出结果:
¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口
图片:
我的预期结果是内容:
49.90
印尼进口
这是基于您提供的 BeautifulSoup 对象的代码:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)
.next_sibling
允许您访问 <b>
和 <style>
标签之外的文本。
使用soup.select_one()
方法:
from bs4 import BeautifulSoup
div = '''<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''
soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]
print(proPrice)
print(pdlink2)
输出:
49.90
印尼进口
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
我想尝试使用 beautifulsoup 从网站中提取产品名称和价格。但是我不知道怎么提取内容。
Python代码:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)
打印出结果:
¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口
图片:
我的预期结果是内容:
49.90
印尼进口
这是基于您提供的 BeautifulSoup 对象的代码:
from bs4 import BeautifulSoup
import re
div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'
soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)
.next_sibling
允许您访问 <b>
和 <style>
标签之外的文本。
使用soup.select_one()
方法:
from bs4 import BeautifulSoup
div = '''<div pagetype="simple_table_nonFashion" class="itemBox"
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num"
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9"
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p
class="proName clearfix"><a id="pdlink2_679026" pmid="0"
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''
soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]
print(proPrice)
print(pdlink2)
输出:
49.90
印尼进口
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors