如何为该节点获取正确的 xpath?刮擦
How to get the right xpath for this node? scrapy
我是 web scrapy 的新手,我一直在尝试从这部分代码中获取正确的 xpath。
hmtl code
我一直在使用这个 scrapy 命令:
response.xpath('//*[@id="companycontent"]/div/div/div[2]/div/div[6]/div').getall()
这是输出:
['<div class="address">\r\n <h4>Address <span>1</span></h4>\r\n <strong>Office : </strong>1715 , 1714<br>\r\n <strong>Floor : </strong>Floor 17<br>\r\n <strong>Building : </strong>Shatha Tower<br>\r\n Dubai Internet City<br><br>\r\n \t\t</div>']
response.xpath('//*[@id="companycontent"]/div/div/div2/div/div[6]/div').get()
'\r\n Address 1\r\n Office : 1715 , 1714
\r\n Floor : Floor 17
\r\n Building : Shatha Tower
\r\n Dubai Internet City
\r\n \t\t'
还有这个:
response.xpath('//div[contains(@class, "address")]/text()').extract()
输出:
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
response.xpath('//div[contains(@class, "address")]/text()').getall()
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
我确定第一个命令可以完成这项工作,但我想知道是否有更短的 xpath 命令来 运行 脚本。
希望有人能帮助我。
通过 xpath 查找文本如下 //tag-name[@class="class-name"]
你可以按照这种方法找到数据
代码:
from selenium import webdriver
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://tecomgroup.ae/directory/company.php?company=0016F00001wcgFJQAY&csrt=2648526569298119449")
data=driver.find_element_by_xpath('//div[@class="address"]')
data.text.split("\n")
输出:
['ADDRESS 1',
'Office : 1715 , 1714',
'Floor : Floor 17',
'Building : Shatha Tower',
'Dubai Internet City']
您也可以使用 css 选择器 response.css('div.address > div.address ::text')
print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)
我是 web scrapy 的新手,我一直在尝试从这部分代码中获取正确的 xpath。
hmtl code
我一直在使用这个 scrapy 命令:
response.xpath('//*[@id="companycontent"]/div/div/div[2]/div/div[6]/div').getall()
这是输出:
['<div class="address">\r\n <h4>Address <span>1</span></h4>\r\n <strong>Office : </strong>1715 , 1714<br>\r\n <strong>Floor : </strong>Floor 17<br>\r\n <strong>Building : </strong>Shatha Tower<br>\r\n Dubai Internet City<br><br>\r\n \t\t</div>']
response.xpath('//*[@id="companycontent"]/div/div/div2/div/div[6]/div').get() '\r\n Address 1\r\n Office : 1715 , 1714
\r\n Floor : Floor 17
\r\n Building : Shatha Tower
\r\n Dubai Internet City
\r\n \t\t'
还有这个:
response.xpath('//div[contains(@class, "address")]/text()').extract()
输出:
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
response.xpath('//div[contains(@class, "address")]/text()').getall() ['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
我确定第一个命令可以完成这项工作,但我想知道是否有更短的 xpath 命令来 运行 脚本。 希望有人能帮助我。
通过 xpath 查找文本如下 //tag-name[@class="class-name"]
你可以按照这种方法找到数据
代码:
from selenium import webdriver
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://tecomgroup.ae/directory/company.php?company=0016F00001wcgFJQAY&csrt=2648526569298119449")
data=driver.find_element_by_xpath('//div[@class="address"]')
data.text.split("\n")
输出:
['ADDRESS 1',
'Office : 1715 , 1714',
'Floor : Floor 17',
'Building : Shatha Tower',
'Dubai Internet City']
您也可以使用 css 选择器 response.css('div.address > div.address ::text')
print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)