如何使用 Python 通过 Selenium 或 lxml 在维基页面信息框中的特定 th 之后提取 href 属性

Question

我遇到的问题是获取维基页面信息框中特定单元格的 href（请参见下图）。具体来说，我想在 "Website" 的 table 行 header 之后获取 3M 官方网站的 href。源代码在图像中突出显示。（这种 wiki 页面格式对于大多数公司的 wiki 页面来说是非常规则的。我进一步计划为许多公司建立网站，所以不只是收集这个..）

我尝试过但不起作用的事情：

# selenium:
driver.find_element_by_xpath("//table[@class='infoboxvcard']/tr[th/text()='Website']").get_attribute("href") 
# lxml:
url = "https://en.wikipedia.org/wiki/3M"
req = requests.get(url)
store = etree.fromstring(req.text)
output = store.xpath("//table[@class='infobox vcard']/tr[th/text()='Website']/td")

适用于特定公司的代码：

driver.get("https://en.wikipedia.org/wiki/3M")
website = driver.find_element_by_xpath("//*[@id='mw-content-text']/div/table[2]/tbody/tr[17]/td/span/a").get_attribute("href")

但是，由于并非所有公司都具有相同的行数，因此当我遍历数百家公司时，此代码将不起作用。

如有任何帮助，我们将不胜感激！提前致谢！

https://en.wikipedia.org/wiki/3M

来自 3m wiki 页面的屏幕截图：

Answer 1

这是一个更强大的 xpath：

website = driver.find_element_by_xpath('//*[@class="url"]/a').get_attribute("href")

如果您知道可以使用的文本：

website = driver.find_element_by_link_text('3M.com').get_attribute("href")

希望对您有所帮助！

Answer 2

你可以做什么，你可以将所有 link_text 存储在 excel sheet 中，然后从 excel 中获取字符串并分配到一个变量中，就像我为一个example.Then 使用我下面的代码应该可以。

wb_link_text="3M.com"
wb_ele_href =driver.find_element_by_xpath("//a[text()[contains(.,'" + wb_link_text +"')]]").get_attribute("href")
print(wb_ele_href)

如果有帮助请告诉我。

Answer 3

从wikipediaSelenium中提取3M官网的href属性就足够了，需要归纳WebDriverWait 使所需的元素可见，您可以使用以下解决方案：

website = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//th[@scope='row' and text()='Website']//following::td[1]/span/a[@class='external text']"))).get_attribute("href")

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

如何使用 Python 通过 Selenium 或 lxml 在维基页面信息框中的特定 th 之后提取 href 属性

How to extract the href attribute after a particular th in the wikipage infobox through Selenium or lxml using Python

selenium

xpath

lxml

python-3.x

webdriverwait