从带有漂亮汤的标签中获取属性
getting an attribute from a tag with beautiful soup
我正在尝试获取属性 'datetime',但在过去的几个小时内似乎无法正确完成:
driver.get("https://cointelegraph.com/tags/bitcoin")
time.sleep(10)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
#print(soup.prettify())
articles = soup.find_all("article")
for article in articles:
print("--------------------------------")
if article.has_attr('datetime'):
print(article['datetime'])
else:
print('no attribute present')
我执行了这个,但似乎没有所述属性:
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
我检查了 HTML 并且 'datetime' 属性在 'article' 标签中。但看起来它只有一个属性,即 'class'.
<article class="post-card-inline" data-v-a5013924="">
<a class="post-card-inline__figure-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
<figure class="post-card-inline__figure">
<div class="lazy-image post-card-inline__cover lazy-image_loaded">
<span class="pending lazy-image__pending pending_dark pending_finished">
<span class="pending__runner">
</span>
</span>
<!-- -->
<img alt="Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL" class="lazy-image__img" pinger-seen="true" src="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg" srcset="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg
1x, https://images.cointelegraph.com/images/740_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg 2x"/>
</div>
<span class="post-card-inline__badge post-card-inline__badge_default">
Price Analysis
</span>
</figure>
</a>
<div class="post-card-inline__content">
<div class="post-card-inline__header">
<a class="post-card-inline__title-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
<span class="post-card-inline__title">
Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL
</span>
</a>
<div class="post-card-inline__meta">
<time class="post-card-inline__date" datetime="2022-04-17">
4 hours ago
</time>
<p class="post-card-inline_
...
您可以使用 bs4 和 requests 获取属性 'datetime',无需 selenium。
from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"mozila/5.0"}
url='https://cointelegraph.com/tags/bitcoin'
req= requests.get(url,headers=headers)
soup = BeautifulSoup(req.content,'html.parser')
for dt in soup.select('time.post-card-inline__date'):
date_time =dt.get('datetime')
print(date_time)
输出:
2022-04-17
2022-04-17
2022-04-17
2022-04-16
2022-04-16
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-14
datetime
不是article
标签属性的问题。因此,您需要进一步研究以找到具有该属性的标签,findall(datetime=True)
并且您可以毫无问题地访问其值。
...
articles = soup.find_all("article")
for article in articles:
for time_tag in article.findall(datetime=True):
print(time_tag[datetime])
我正在尝试获取属性 'datetime',但在过去的几个小时内似乎无法正确完成:
driver.get("https://cointelegraph.com/tags/bitcoin")
time.sleep(10)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
#print(soup.prettify())
articles = soup.find_all("article")
for article in articles:
print("--------------------------------")
if article.has_attr('datetime'):
print(article['datetime'])
else:
print('no attribute present')
我执行了这个,但似乎没有所述属性:
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
我检查了 HTML 并且 'datetime' 属性在 'article' 标签中。但看起来它只有一个属性,即 'class'.
<article class="post-card-inline" data-v-a5013924="">
<a class="post-card-inline__figure-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
<figure class="post-card-inline__figure">
<div class="lazy-image post-card-inline__cover lazy-image_loaded">
<span class="pending lazy-image__pending pending_dark pending_finished">
<span class="pending__runner">
</span>
</span>
<!-- -->
<img alt="Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL" class="lazy-image__img" pinger-seen="true" src="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg" srcset="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg
1x, https://images.cointelegraph.com/images/740_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg 2x"/>
</div>
<span class="post-card-inline__badge post-card-inline__badge_default">
Price Analysis
</span>
</figure>
</a>
<div class="post-card-inline__content">
<div class="post-card-inline__header">
<a class="post-card-inline__title-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
<span class="post-card-inline__title">
Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL
</span>
</a>
<div class="post-card-inline__meta">
<time class="post-card-inline__date" datetime="2022-04-17">
4 hours ago
</time>
<p class="post-card-inline_
...
您可以使用 bs4 和 requests 获取属性 'datetime',无需 selenium。
from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"mozila/5.0"}
url='https://cointelegraph.com/tags/bitcoin'
req= requests.get(url,headers=headers)
soup = BeautifulSoup(req.content,'html.parser')
for dt in soup.select('time.post-card-inline__date'):
date_time =dt.get('datetime')
print(date_time)
输出:
2022-04-17
2022-04-17
2022-04-17
2022-04-16
2022-04-16
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-15
2022-04-14
datetime
不是article
标签属性的问题。因此,您需要进一步研究以找到具有该属性的标签,findall(datetime=True)
并且您可以毫无问题地访问其值。
...
articles = soup.find_all("article")
for article in articles:
for time_tag in article.findall(datetime=True):
print(time_tag[datetime])