从带有漂亮汤的标签中获取属性

Question

我正在尝试获取属性 'datetime'，但在过去的几个小时内似乎无法正确完成：

    driver.get("https://cointelegraph.com/tags/bitcoin")
    time.sleep(10)
   
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')
    #print(soup.prettify())

    articles = soup.find_all("article")
 
    for article in articles:
        print("--------------------------------")
        if article.has_attr('datetime'):
            print(article['datetime'])
   else:
        print('no attribute present')

我执行了这个，但似乎没有所述属性：

--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------
no attribute present
--------------------------------

我检查了 HTML 并且 'datetime' 属性在 'article' 标签中。但看起来它只有一个属性，即 'class'.

<article class="post-card-inline" data-v-a5013924="">
 <a class="post-card-inline__figure-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
  <figure class="post-card-inline__figure">
   <div class="lazy-image post-card-inline__cover lazy-image_loaded">
    <span class="pending lazy-image__pending pending_dark pending_finished">
     <span class="pending__runner">
     </span>
    </span>
    <!-- -->
    <img alt="Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL" class="lazy-image__img" pinger-seen="true" src="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg" srcset="https://images.cointelegraph.com/images/370_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg
1x, https://images.cointelegraph.com/images/740_aHR0cHM6Ly9zMy5jb2ludGVsZWdyYXBoLmNvbS91cGxvYWRzLzIwMjItMDQvYWJlMzJhMjYtMmMwMi00ODczLTllNGUtYWQ2ZTdmMzEzOGNlLmpwZw==.jpg 2x"/>
   </div>
   <span class="post-card-inline__badge post-card-inline__badge_default">
    Price Analysis
   </span>
  </figure>
 </a>
 <div class="post-card-inline__content">
  <div class="post-card-inline__header">
   <a class="post-card-inline__title-link" href="/news/top-5-cryptocurrencies-to-watch-this-week-btc-xrp-link-bch-fil">
    <span class="post-card-inline__title">
     Top 5 cryptocurrencies to watch this week: BTC, XRP, LINK, BCH, FIL
    </span>
   </a>
   <div class="post-card-inline__meta">
    <time class="post-card-inline__date" datetime="2022-04-17">
     4 hours ago
    </time>
    <p class="post-card-inline_
    ...

Answer 1

您可以使用 bs4 和 requests 获取属性 'datetime'，无需 selenium。

from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"mozila/5.0"}
url='https://cointelegraph.com/tags/bitcoin'

req= requests.get(url,headers=headers)

soup = BeautifulSoup(req.content,'html.parser')

for dt in soup.select('time.post-card-inline__date'):
    date_time =dt.get('datetime')
    print(date_time)

输出：

Answer 2

datetime不是article标签属性的问题。因此，您需要进一步研究以找到具有该属性的标签，findall(datetime=True) 并且您可以毫无问题地访问其值。

...
articles = soup.find_all("article")

for article in articles:
    for time_tag in article.findall(datetime=True):
        print(time_tag[datetime])

从带有漂亮汤的标签中获取属性

getting an attribute from a tag with beautiful soup

python

datetime

beautifulsoup