如何使用 xpath 和 selenium 在两个 h 标签之间 select 元素
How to select elements between two h tags using xpath and selenium
我正在尝试抓取这个website
但是看到它的元素我对如何提取位于两个 h3 标签之间的数据感到困惑。 h3 标签元素包含国家名称。在下一个 h3 标签(具有另一个国家/地区名称)之前,有些帖子包含截止日期文本和与该截止日期相关的文件链接。事实上,他们每个人都是为那个国家发布的商业招标机会。
检查元素:
获取数据后,这就是我将它们存储到数据库中的方式。
请注意,我们无法预测将来 h3 元素中的国家名称和拼写可能是什么,但我们不想错过网站上新发布的任何机会。
谁能帮我解决 XPath 或 selenium 的问题。非常感谢您的帮助。
在过去的几个小时里,我试图弄明白,但想不出任何好主意。提前谢谢你
提取并打印文本,例如Sahel - Sécurité et Etat de Droit & Santé et Droits Sexuels et Reproductifs - OKP-SHL-20047 (210.78 kB)(法语),等等来自所有 <li class="Fruit">
使用 and python you can use either of the following :
使用css_selector
:
driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange")
print([my_elem.text for my_elem in driver.find_elements_by_css_selector("h4 +p")])
使用xpath
:
driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange")
print([my_elem.text for my_elem in driver.find_elements_by_xpath("//h4//following::p[1]")])
理想情况下你需要诱导 for visibility_of_all_elements_located()
and you can use either of the following :
使用CSS_SELECTOR
:
driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4 +p")))])
使用XPATH
:
driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4//following::p[1]")))])
控制台输出:
['Sahel - Sécurité et Etat de Droit & Santé et Droits Sexuels et Reproductifs - OKP-SHL-20047 (210.78 kB) (in French)', 'Sahel - Santé et Droits Sexuels et Reproductifs - OKP-SHL-20048 (193.91 kB) (in French)', 'Sahel - Sécurité et Etat de Droit - OKP-SHL-20049 (209.17 kB) (in French)', 'Capacity strengthening for Resilience in the Sahel\u200c (365.57 kB)', 'Tunesia - Sécurité, Stabilité et Migration - OKP-TUN-40014 (209.87 kB) (in French)', 'Tunisia - Country Plan of Implementation - Orange Knowledge (423.8 kB)', 'Nigeria - Food and Nutrition Security - OKP-NIG-20050 (211.74 kB)', 'Sub-Sahara Africa - Health - OKP-SSA-50002 (214.27 kB)', 'Health Systems Strengthening through education and training – In Burkina Faso, Burundi, Ethiopia, Mali, Niger (301.71 kB)', 'Benin - Santé et Droits Sexuels et Reproductifs - OKP-BEN-20053 (189.51 kB) (in French)', 'Benin - Country Plan of Implementation - Orange Knowledge (in French)', 'Indonesia - Security and the Rule of Law - OKP-IDN-20055 (204.89 kB)', 'Indonesia - Food and Nutrition Security, Water - OKP-IDN-20056 (202.17 kB)', 'Indonesia - Country Plan of Implementation - Orange Knowledge (613.86 kB)', 'Horn of Africa – Food and Nutrition Security, Security and Rule of Law - OKP-EAR-20058 (236.34 kB)', 'Capacity strengthening for Resilience in the Horn of Africa (252.64 kB)\u200c', 'Thematic call SRHR - OKP-SRHR-40015 (235.72 kB)']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
尝试此代码以获得所需的输出
container = driver.find_element_by_xpath('//div[h3 and h4]')
headers = container.find_elements_by_xpath('./h3')
data = []
for index, _header in enumerate(headers):
header = _header.text
_deadlines = container.find_elements_by_xpath('./h3[position()={}]/following-sibling::h4[starts-with(., "Deadline") and not(preceding-sibling::h3[position()={}])]'.format(index + 1, index + 2))
_data = []
for _deadline in _deadlines:
link = _deadline.find_element_by_xpath('./following-sibling::p[1]//a').get_attribute('href')
_data.append({_deadline.text: link})
data.append([header, _data])
for item in data:
print(item)
注意:这允许在每个截止日期 的第一个link 获得(即使有多个link)
我正在尝试抓取这个website
但是看到它的元素我对如何提取位于两个 h3 标签之间的数据感到困惑。 h3 标签元素包含国家名称。在下一个 h3 标签(具有另一个国家/地区名称)之前,有些帖子包含截止日期文本和与该截止日期相关的文件链接。事实上,他们每个人都是为那个国家发布的商业招标机会。
检查元素:
获取数据后,这就是我将它们存储到数据库中的方式。
请注意,我们无法预测将来 h3 元素中的国家名称和拼写可能是什么,但我们不想错过网站上新发布的任何机会。
谁能帮我解决 XPath 或 selenium 的问题。非常感谢您的帮助。 在过去的几个小时里,我试图弄明白,但想不出任何好主意。提前谢谢你
提取并打印文本,例如Sahel - Sécurité et Etat de Droit & Santé et Droits Sexuels et Reproductifs - OKP-SHL-20047 (210.78 kB)(法语),等等来自所有 <li class="Fruit">
使用
使用
css_selector
:driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange") print([my_elem.text for my_elem in driver.find_elements_by_css_selector("h4 +p")])
使用
xpath
:driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange") print([my_elem.text for my_elem in driver.find_elements_by_xpath("//h4//following::p[1]")])
理想情况下你需要诱导 visibility_of_all_elements_located()
and you can use either of the following
使用
CSS_SELECTOR
:driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange") print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4 +p")))])
使用
XPATH
:driver.get("https://www.nuffic.nl/en/subjects/orange-knowledge-programme/calls-group-training-maximum-24-months-tmt-plus-orange") print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4//following::p[1]")))])
控制台输出:
['Sahel - Sécurité et Etat de Droit & Santé et Droits Sexuels et Reproductifs - OKP-SHL-20047 (210.78 kB) (in French)', 'Sahel - Santé et Droits Sexuels et Reproductifs - OKP-SHL-20048 (193.91 kB) (in French)', 'Sahel - Sécurité et Etat de Droit - OKP-SHL-20049 (209.17 kB) (in French)', 'Capacity strengthening for Resilience in the Sahel\u200c (365.57 kB)', 'Tunesia - Sécurité, Stabilité et Migration - OKP-TUN-40014 (209.87 kB) (in French)', 'Tunisia - Country Plan of Implementation - Orange Knowledge (423.8 kB)', 'Nigeria - Food and Nutrition Security - OKP-NIG-20050 (211.74 kB)', 'Sub-Sahara Africa - Health - OKP-SSA-50002 (214.27 kB)', 'Health Systems Strengthening through education and training – In Burkina Faso, Burundi, Ethiopia, Mali, Niger (301.71 kB)', 'Benin - Santé et Droits Sexuels et Reproductifs - OKP-BEN-20053 (189.51 kB) (in French)', 'Benin - Country Plan of Implementation - Orange Knowledge (in French)', 'Indonesia - Security and the Rule of Law - OKP-IDN-20055 (204.89 kB)', 'Indonesia - Food and Nutrition Security, Water - OKP-IDN-20056 (202.17 kB)', 'Indonesia - Country Plan of Implementation - Orange Knowledge (613.86 kB)', 'Horn of Africa – Food and Nutrition Security, Security and Rule of Law - OKP-EAR-20058 (236.34 kB)', 'Capacity strengthening for Resilience in the Horn of Africa (252.64 kB)\u200c', 'Thematic call SRHR - OKP-SRHR-40015 (235.72 kB)']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
尝试此代码以获得所需的输出
container = driver.find_element_by_xpath('//div[h3 and h4]')
headers = container.find_elements_by_xpath('./h3')
data = []
for index, _header in enumerate(headers):
header = _header.text
_deadlines = container.find_elements_by_xpath('./h3[position()={}]/following-sibling::h4[starts-with(., "Deadline") and not(preceding-sibling::h3[position()={}])]'.format(index + 1, index + 2))
_data = []
for _deadline in _deadlines:
link = _deadline.find_element_by_xpath('./following-sibling::p[1]//a').get_attribute('href')
_data.append({_deadline.text: link})
data.append([header, _data])
for item in data:
print(item)
注意:这允许在每个截止日期 的第一个link 获得(即使有多个link)