使用 Selenium 和 Python 进行网页抓取
Webscraping with Selenium and Python
我是编码初学者,尝试学习使用 selenium 进行网络抓取,我一直在做一个项目,用字典检查用每个单词破解密码需要多长时间。
所以我的代码读取一个 .txt 文件,每行都有一个单词,然后将其写入栏,它会复制破解它需要多长时间。
问题是我无法捕获网页的一部分 html 代码,我需要帮助。
这是我的代码
# This program run spanish dictionary and check how secure password there are
import random
import time
from selenium import webdriver
#Paste here Chromedriver path
CHROMEDRIVERPATH = "C:\Program Files (x86)\chromedriver.exe"
#Paste here dictionary path in .txt format
dictionary = readFile("spanish_dictionary.txt")
date = str(time.strftime("%Y-%m-%dT%H-%M-%S"))
#read files
driver = webdriver.Chrome(CHROMEDRIVERPATH)
#webpage target
driver.get("https://www.security.org/how-secure-is-my-password/")
time.sleep(2)
#Label
writeFile("results_" + date + ".txt","word,time \n")
#File Content
for word in dictionary:
bar = driver.find_element_by_id('password')
bar.send_keys(word)
bar.clear()
timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').get_attribute("class")
result = word + "," + timeToCrack + "\n"
writeFile("results_" + date + ".txt",result)
time.sleep(random.uniform(0.4,1.0))
这是html页面代码
<p class="result__text result__time">2 hundred microseconds</p>
我在输出文件中得到这个:
word,time
a,result__text result__time
aba,result__text result__time
abaá,result__text result__time
我想要这个:
word,time
a,6 hundred picoseconds
aba,4 hundred nanoseconds
abaá,5 milliseconds
你想要:
timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').text
Java 等价于:
driver.findElement(By.xpath("//*[@id="hsimp"]/div[1]/div[3]/p[2]").getText();
要提取并打印你需要归纳的结果 for the visibility_of_element_located()
and you can use either of the following :
使用 CSS_SELECTOR
和 get_attribute()
:
driver.get('https://www.security.org/how-secure-is-my-password/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#password"))).send_keys("lordkoda")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.result p.result__text.result__time"))).get_attribute("innerHTML"))
使用 XPATH
和 text 属性:
driver.get('https://www.security.org/how-secure-is-my-password/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='password']"))).send_keys("lordkoda")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='result']//p[@class='result__text result__time']"))).text)
控制台输出:
5 seconds
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
我是编码初学者,尝试学习使用 selenium 进行网络抓取,我一直在做一个项目,用字典检查用每个单词破解密码需要多长时间。 所以我的代码读取一个 .txt 文件,每行都有一个单词,然后将其写入栏,它会复制破解它需要多长时间。 问题是我无法捕获网页的一部分 html 代码,我需要帮助。
这是我的代码
# This program run spanish dictionary and check how secure password there are
import random
import time
from selenium import webdriver
#Paste here Chromedriver path
CHROMEDRIVERPATH = "C:\Program Files (x86)\chromedriver.exe"
#Paste here dictionary path in .txt format
dictionary = readFile("spanish_dictionary.txt")
date = str(time.strftime("%Y-%m-%dT%H-%M-%S"))
#read files
driver = webdriver.Chrome(CHROMEDRIVERPATH)
#webpage target
driver.get("https://www.security.org/how-secure-is-my-password/")
time.sleep(2)
#Label
writeFile("results_" + date + ".txt","word,time \n")
#File Content
for word in dictionary:
bar = driver.find_element_by_id('password')
bar.send_keys(word)
bar.clear()
timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').get_attribute("class")
result = word + "," + timeToCrack + "\n"
writeFile("results_" + date + ".txt",result)
time.sleep(random.uniform(0.4,1.0))
这是html页面代码
<p class="result__text result__time">2 hundred microseconds</p>
我在输出文件中得到这个:
word,time
a,result__text result__time
aba,result__text result__time
abaá,result__text result__time
我想要这个:
word,time
a,6 hundred picoseconds
aba,4 hundred nanoseconds
abaá,5 milliseconds
你想要:
timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').text
Java 等价于:
driver.findElement(By.xpath("//*[@id="hsimp"]/div[1]/div[3]/p[2]").getText();
要提取并打印你需要归纳的结果visibility_of_element_located()
and you can use either of the following
使用
CSS_SELECTOR
和get_attribute()
:driver.get('https://www.security.org/how-secure-is-my-password/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#password"))).send_keys("lordkoda") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.result p.result__text.result__time"))).get_attribute("innerHTML"))
使用
XPATH
和 text 属性:driver.get('https://www.security.org/how-secure-is-my-password/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='password']"))).send_keys("lordkoda") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='result']//p[@class='result__text result__time']"))).text)
控制台输出:
5 seconds
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC