如何使用 Selenium 和 Python 提取 https://tengrinews.kz 中 7 个主要新闻中每一个的时间和标题
How to extract the time and title of each of the 7 main news within https://tengrinews.kz using Selenium and Python
我需要从本网站抓取 7 条主要新闻 - tengrinews.kz,每条新闻的日期、时间和标题。我使用 selenium 并安装了 firefox 开发者版。
我查看了该网站,7 条新闻位于此结构中:
<body>
<header> ... some stuff </header>
<main>
<div class="tn-main-news-grid">
<div class="tn-main-news-item firs-column tn-three-column tn-background-cover">
<span class="tn-main-news-title" style="z-index: 1;">BIG MAJOR NEWS TEXT</span>
<a href="/kazakhstan_news/major-news/" class="tn-link"><span class="tn-hidden">BIG MAJOR NEWS TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news1 TEXT</span>
<a href="/kazakhstan_news/news1/" class="tn-link">
<span class="tn-hidden">news1 TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news2 TEXT</span>
<a href="/kazakhstan_news/news2/" class="tn-link">
<span class="tn-hidden">news2 TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news3 TEXT</span>
<a href="/kazakhstan_news/news3/" class="tn-link">
<span class="tn-hidden">news3 TEXT</span></a>
</div>
</div>
</main>
</body>
我通过 xpath 或 css_selector 找到了包含所有 7 个新闻的 div 框架。我确实得到了 firefox web 元素,但它是一个列表,而且是空的!
如果我尝试定位单个 href 或 div,它会返回一些 'list' 类型的 Web 元素,并且此 href 必须具有文本属性(根据 selenium 文档)- 但它给我错误“没有属性文本
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://tengrinews.kz")
css_to_big_news = 'html body div.my-app main section.tn-main-section.tn-container div.tn-main-news-container.tn-sub-container div.tn-main-news-grid div.tn-main-news-item.firs-column.tn-three-column.tn-background-cover a.tn-link'
href_big = driver.find_elements_by_css_selector(css_to_big_news)
print('type of href_big is %s and length is %d' %(type(href_big), len(href_big)))
print(href_big[0].text) #this is wrong
print(href_big.text()) # this is wrong with parenthesis
怎么了?
提取文本,例如TEXT,来自每个 <span>
使用 and python you have to induce for visibility_of_all_elements_located()
and you can use either of the following :
使用CSS_SELECTOR
:
driver.get("https://tengrinews.kz/")
print("Date and Time:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item ul.tn-data-list>li>span time")))])
print("Title:")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item span.tn-main-news-title")))])
使用XPATH
:
driver.get("https://tengrinews.kz/")
print("Date and Time:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//ul[@class='tn-data-list']/li/span//time")))])
print("Title:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//span[@class='tn-main-news-title']")))])
控制台输出:
Date and Time:
['вчера, 18:27', 'вчера, 21:45', 'вчера, 20:52', 'вчера, 19:48', 'вчера, 17:34', 'вчера, 14:50', 'вчера, 14:32']
Title:
['Жара до 42 градусов ожидается в регионах Казахстана', 'Строгий карантин вводят в Мангистауской области', 'Нехватку вакцин и новую "суровую" волну COVID-19 предрекли в мире', 'Столицу Казахстана "оживили"', 'Жители Актау собрались на площади из-за отсутствия лекарств в аптеках', 'Строгий карантин в Нур-Султане продлили до 2 августа', '"Едят антибиотики". Врач из Павлодара объяснил рост числа тяжелых больных']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
结尾
Link 到有用的文档:
get_attribute()
方法Gets the given attribute or property of the element.
text
属性returnsThe text of the element.
- Difference between text and innerHTML using Selenium
我需要从本网站抓取 7 条主要新闻 - tengrinews.kz,每条新闻的日期、时间和标题。我使用 selenium 并安装了 firefox 开发者版。
我查看了该网站,7 条新闻位于此结构中:
<body>
<header> ... some stuff </header>
<main>
<div class="tn-main-news-grid">
<div class="tn-main-news-item firs-column tn-three-column tn-background-cover">
<span class="tn-main-news-title" style="z-index: 1;">BIG MAJOR NEWS TEXT</span>
<a href="/kazakhstan_news/major-news/" class="tn-link"><span class="tn-hidden">BIG MAJOR NEWS TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news1 TEXT</span>
<a href="/kazakhstan_news/news1/" class="tn-link">
<span class="tn-hidden">news1 TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news2 TEXT</span>
<a href="/kazakhstan_news/news2/" class="tn-link">
<span class="tn-hidden">news2 TEXT</span></a>
</div>
<div class="tn-main-news-item">
<span class="tn-main-news-title">news3 TEXT</span>
<a href="/kazakhstan_news/news3/" class="tn-link">
<span class="tn-hidden">news3 TEXT</span></a>
</div>
</div>
</main>
</body>
我通过 xpath 或 css_selector 找到了包含所有 7 个新闻的 div 框架。我确实得到了 firefox web 元素,但它是一个列表,而且是空的!
如果我尝试定位单个 href 或 div,它会返回一些 'list' 类型的 Web 元素,并且此 href 必须具有文本属性(根据 selenium 文档)- 但它给我错误“没有属性文本
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://tengrinews.kz")
css_to_big_news = 'html body div.my-app main section.tn-main-section.tn-container div.tn-main-news-container.tn-sub-container div.tn-main-news-grid div.tn-main-news-item.firs-column.tn-three-column.tn-background-cover a.tn-link'
href_big = driver.find_elements_by_css_selector(css_to_big_news)
print('type of href_big is %s and length is %d' %(type(href_big), len(href_big)))
print(href_big[0].text) #this is wrong
print(href_big.text()) # this is wrong with parenthesis
怎么了?
提取文本,例如TEXT,来自每个 <span>
使用 visibility_of_all_elements_located()
and you can use either of the following
使用
CSS_SELECTOR
:driver.get("https://tengrinews.kz/") print("Date and Time:") print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item ul.tn-data-list>li>span time")))]) print("Title:") print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item span.tn-main-news-title")))])
使用
XPATH
:driver.get("https://tengrinews.kz/") print("Date and Time:") print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//ul[@class='tn-data-list']/li/span//time")))]) print("Title:") print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//span[@class='tn-main-news-title']")))])
控制台输出:
Date and Time: ['вчера, 18:27', 'вчера, 21:45', 'вчера, 20:52', 'вчера, 19:48', 'вчера, 17:34', 'вчера, 14:50', 'вчера, 14:32'] Title: ['Жара до 42 градусов ожидается в регионах Казахстана', 'Строгий карантин вводят в Мангистауской области', 'Нехватку вакцин и новую "суровую" волну COVID-19 предрекли в мире', 'Столицу Казахстана "оживили"', 'Жители Актау собрались на площади из-за отсутствия лекарств в аптеках', 'Строгий карантин в Нур-Султане продлили до 2 августа', '"Едят антибиотики". Врач из Павлодара объяснил рост числа тяжелых больных']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
结尾
Link 到有用的文档:
get_attribute()
方法Gets the given attribute or property of the element.
text
属性returnsThe text of the element.
- Difference between text and innerHTML using Selenium