如何为该节点获取正确的 xpath?刮擦

How to get the right xpath for this node? scrapy

我是 web scrapy 的新手,我一直在尝试从这部分代码中获取正确的 xpath。

from this website

hmtl code

我一直在使用这个 scrapy 命令:

response.xpath('//*[@id="companycontent"]/div/div/div[2]/div/div[6]/div').getall()

这是输出:

['<div class="address">\r\n                                    <h4>Address <span>1</span></h4>\r\n                                    <strong>Office : </strong>1715 , 1714<br>\r\n                                    <strong>Floor : </strong>Floor 17<br>\r\n                                    <strong>Building : </strong>Shatha Tower<br>\r\n                                    Dubai Internet City<br><br>\r\n                        \t\t</div>']

response.xpath('//*[@id="companycontent"]/div/div/div2/div/div[6]/div').get() '\r\n Address 1\r\n Office : 1715 , 1714
\r\n Floor : Floor 17
\r\n Building : Shatha Tower
\r\n Dubai Internet City

\r\n \t\t'

还有这个:

response.xpath('//div[contains(@class, "address")]/text()').extract()

输出:

['\r\n                        \r\n                            \r\n                        \t\t\t\t\t\t\t\t                                ', '\r\n                                    ', '\r\n                                    ', '1715 , 1714', '\r\n                                    ', 'Floor 17', '\r\n                                    ', 'Shatha Tower', '\r\n                                    Dubai Internet City', '\r\n                        \t\t', '        \r\n                        \t\t\r\n\r\n\t\t\t\t\t\t                            \r\n                    ']

response.xpath('//div[contains(@class, "address")]/text()').getall() ['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']

我确定第一个命令可以完成这项工作,但我想知道是否有更短的 xpath 命令来 运行 脚本。 希望有人能帮助我。

通过 xpath 查找文本如下 //tag-name[@class="class-name"] 你可以按照这种方法找到数据

代码:

from selenium import webdriver
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://tecomgroup.ae/directory/company.php?company=0016F00001wcgFJQAY&csrt=2648526569298119449")
data=driver.find_element_by_xpath('//div[@class="address"]')
data.text.split("\n")

输出:

['ADDRESS 1',
 'Office : 1715 , 1714',
 'Floor : Floor 17',
 'Building : Shatha Tower',
 'Dubai Internet City']

您也可以使用 css 选择器 response.css('div.address > div.address ::text')

print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)