使用 BeautifulSoup 从锚标记内的多个属性中提取单个属性
Extracting a single attribute from multiple attributes within an anchor tag using BeautifulSoup
如标题所示,这是我的代码:
import bs4
import re
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://finance.yahoo.com/screener/predefined/undervalued_growth_stocks"
loadpage = uReq(my_url)
showloadpage = loadpage.read()
loadpage.close()
soupit = soup(showloadpage, "html.parser")
#the regex works - returns all of the "tr" tags, i.e. containers
containers = soupit.findAll("tr", {"class" : re.compile("data-row.*")})
for container in containers:
container.findAll("a", {"class" : "Fw(b)"})
我得到的结果是:
[<a class="Fw(b)" data-reactid="69" data-symbol="AMAT" href="/quote/AMAT?p=AMAT">AMAT</a>]
[<a class="Fw(b)" data-reactid="99" data-symbol="MS" href="/quote/MS?p=MS">MS</a>]
[<a class="Fw(b)" data-reactid="129" data-symbol="NLY" href="/quote/NLY?p=NLY">NLY</a>]
[<a class="Fw(b)" data-reactid="159" data-symbol="ODP" href="/quote/ODP?p=ODP">ODP</a>]
[<a class="Fw(b)" data-reactid="189" data-symbol="FCAU" href="/quote/FCAU?p=FCAU">FCAU</a>]
[<a class="Fw(b)" data-reactid="219" data-symbol="RDC" href="/quote/RDC?p=RDC">RDC</a>]
[<a class="Fw(b)" data-reactid="249" data-symbol="ING" href="/quote/ING?p=ING">ING</a>]
[<a class="Fw(b)" data-reactid="279" data-symbol="FTI" href="/quote/FTI?p=FTI">FTI</a>]
[<a class="Fw(b)" data-reactid="309" data-symbol="BX" href="/quote/BX?p=BX">BX</a>]
[<a class="Fw(b)" data-reactid="339" data-symbol="FNSR" href="/quote/FNSR?p=FNSR">FNSR</a>]
我现在想得到的是 data-symbol
属性,但最终我也想要 href
。我现在尝试了几种不同的方法,但没有运气。任何帮助将不胜感激。
只需遍历您为每个容器找到的链接并获取您想要的值。 link
是一个 dictionary
,因此如果您不想使用 get()
方法,您可以使用属性名称作为键轻松获取值,例如 link["href"]
。
尝试 lxml
解析器,它比 html.parser
.
快很多
import re
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://finance.yahoo.com/screener/predefined/undervalued_growth_stocks"
loadpage = uReq(my_url)
showloadpage = loadpage.read()
loadpage.close()
soupit = soup(showloadpage, "lxml")
#the regex works - returns all of the "tr" tags, i.e. containers
containers = soupit.find_all("tr", {"class" : re.compile("data-row.*")})
for container in containers:
links = container.find_all("a", {"class" : "Fw(b)"})
for link in links:
data_symbol = link.get("data-symbol")
href = link.get("href")
print(data_symbol, href)
如标题所示,这是我的代码:
import bs4
import re
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://finance.yahoo.com/screener/predefined/undervalued_growth_stocks"
loadpage = uReq(my_url)
showloadpage = loadpage.read()
loadpage.close()
soupit = soup(showloadpage, "html.parser")
#the regex works - returns all of the "tr" tags, i.e. containers
containers = soupit.findAll("tr", {"class" : re.compile("data-row.*")})
for container in containers:
container.findAll("a", {"class" : "Fw(b)"})
我得到的结果是:
[<a class="Fw(b)" data-reactid="69" data-symbol="AMAT" href="/quote/AMAT?p=AMAT">AMAT</a>]
[<a class="Fw(b)" data-reactid="99" data-symbol="MS" href="/quote/MS?p=MS">MS</a>]
[<a class="Fw(b)" data-reactid="129" data-symbol="NLY" href="/quote/NLY?p=NLY">NLY</a>]
[<a class="Fw(b)" data-reactid="159" data-symbol="ODP" href="/quote/ODP?p=ODP">ODP</a>]
[<a class="Fw(b)" data-reactid="189" data-symbol="FCAU" href="/quote/FCAU?p=FCAU">FCAU</a>]
[<a class="Fw(b)" data-reactid="219" data-symbol="RDC" href="/quote/RDC?p=RDC">RDC</a>]
[<a class="Fw(b)" data-reactid="249" data-symbol="ING" href="/quote/ING?p=ING">ING</a>]
[<a class="Fw(b)" data-reactid="279" data-symbol="FTI" href="/quote/FTI?p=FTI">FTI</a>]
[<a class="Fw(b)" data-reactid="309" data-symbol="BX" href="/quote/BX?p=BX">BX</a>]
[<a class="Fw(b)" data-reactid="339" data-symbol="FNSR" href="/quote/FNSR?p=FNSR">FNSR</a>]
我现在想得到的是 data-symbol
属性,但最终我也想要 href
。我现在尝试了几种不同的方法,但没有运气。任何帮助将不胜感激。
只需遍历您为每个容器找到的链接并获取您想要的值。 link
是一个 dictionary
,因此如果您不想使用 get()
方法,您可以使用属性名称作为键轻松获取值,例如 link["href"]
。
尝试 lxml
解析器,它比 html.parser
.
import re
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://finance.yahoo.com/screener/predefined/undervalued_growth_stocks"
loadpage = uReq(my_url)
showloadpage = loadpage.read()
loadpage.close()
soupit = soup(showloadpage, "lxml")
#the regex works - returns all of the "tr" tags, i.e. containers
containers = soupit.find_all("tr", {"class" : re.compile("data-row.*")})
for container in containers:
links = container.find_all("a", {"class" : "Fw(b)"})
for link in links:
data_symbol = link.get("data-symbol")
href = link.get("href")
print(data_symbol, href)