如何让我们的网络抓取脚本检查两种情况,但只执行需要的一种
How to make our webscraping script check both scenarios but execute only the one needed
我在 website 上抓取了一些数据,这是我的脚本:
import warnings
warnings.filterwarnings("ignore")
import re
import requests
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy as np
import shutil
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
#options.add_argument('enable-logging')
options.add_argument("start-maximized")
#options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
url = 'https://www.boursorama.com/'
driver.get(url)
cookie = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="didomi-notice-agree-button"]')))
try:
cookie.click()
except:
pass
df = pd.read_excel('liste.xlsx')
df2 = pd.DataFrame(df)
df3 = df2['Entreprises'].values.tolist()
currencies = []
for i in df3:
try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")
results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
print(currencies)
liste.xlsx
只是一个 excel 文件,其中包含我的循环的企业名称:
liste
这是我的输出:
TotalEnergies
TotalEnergies
Engie
Engie
BNP
BNP
['45.59', '11.07', '49.03']
没看懂,我的脚本好像是try
,也是except
。我按预期有 3 个输出,但每个企业打印两次。我的目标是:如果需要执行 try,否则执行 except。
我可以改进我的代码使其只执行一个吗?需要的那个。
因为有时在搜索企业时,您需要更具体,并且该站点为您提供了一些替代方案,因此此代码:
try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")
results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
有时您在搜索栏上输入正确的名称,网站会立即转到所需的页面,因此代码如下:
except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
但是如何让脚本检查两种情况但只执行需要的一种?提高时间性能 ?
“我的目标是:如果需要执行 try,否则执行 except。”
这正是它正在做的。我建议研究如何调试代码。您可以逐行 运行 它,并遵循逻辑,您会看到发生了什么。
当您执行 try/except
时,它会“尝试”执行 try
块中的脚本。如果成功,它会跳过 except
块。如果它在 try
块中的某个点失败,它就会去执行异常脚本。
它似乎是 运行 两者的原因是因为,从技术上讲,正如我上面所描述的,它确实 运行 两者。由于 print()
语句的位置,您看到此打印两次。
它进入 try
块,然后在开头打印带有 print(i)
的 i 。在 print(i)
之后的 try
块中的某个点,引发了 error/exception,然后它转到 except
块,再次打印 i 和 print(i)
在该块的开头。
如果你想让它寻找一个条件并且只执行你想要的那个,那么你需要使用 if
块来检查条件,而不是 try/except
.
话虽如此,与使用 Selenium 渲染相比,从源获取数据要高效得多。您还会获得更多的数据。我不知道您希望从响应中得到什么,但这就是您得到的:click here
代码:
import requests
from bs4 import BeautifulSoup
df3 = ['TotalEnergies','Engie','BNP']
currencies = []
for i in df3:
url = f'https://www.boursorama.com/recherche/ajax?query={i}&searchId='
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
symbol = soup.find('a', {'class':'search__list-link'})['href'].split('/')[-2]
url = 'https://www.boursorama.com/bourse/action/graph/ws/GetTicksEOD'
payload = {
'symbol': symbol,
'length': '1',
'period': '0',
'guid': ''}
jsonData = requests.get(url, params=payload).json()
data = jsonData['d']
name = data['Name']
qd = data ['qd']['c']
currencies.append(qd)
print(f'{name}: {qd}')
print(currencies)
输出:
TOTALENERGIES: 45.59
ENGIE: 11.07
BNP PARIBAS: 49.03
[45.59, 11.07, 49.03]
我在 website 上抓取了一些数据,这是我的脚本:
import warnings
warnings.filterwarnings("ignore")
import re
import requests
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy as np
import shutil
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
#options.add_argument('enable-logging')
options.add_argument("start-maximized")
#options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
url = 'https://www.boursorama.com/'
driver.get(url)
cookie = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="didomi-notice-agree-button"]')))
try:
cookie.click()
except:
pass
df = pd.read_excel('liste.xlsx')
df2 = pd.DataFrame(df)
df3 = df2['Entreprises'].values.tolist()
currencies = []
for i in df3:
try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")
results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
print(currencies)
liste.xlsx
只是一个 excel 文件,其中包含我的循环的企业名称:
liste
这是我的输出:
TotalEnergies
TotalEnergies
Engie
Engie
BNP
BNP
['45.59', '11.07', '49.03']
没看懂,我的脚本好像是try
,也是except
。我按预期有 3 个输出,但每个企业打印两次。我的目标是:如果需要执行 try,否则执行 except。
我可以改进我的代码使其只执行一个吗?需要的那个。
因为有时在搜索企业时,您需要更具体,并且该站点为您提供了一些替代方案,因此此代码:
try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")
results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
有时您在搜索栏上输入正确的名称,网站会立即转到所需的页面,因此代码如下:
except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + '\n')
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
但是如何让脚本检查两种情况但只执行需要的一种?提高时间性能 ?
“我的目标是:如果需要执行 try,否则执行 except。”
这正是它正在做的。我建议研究如何调试代码。您可以逐行 运行 它,并遵循逻辑,您会看到发生了什么。
当您执行 try/except
时,它会“尝试”执行 try
块中的脚本。如果成功,它会跳过 except
块。如果它在 try
块中的某个点失败,它就会去执行异常脚本。
它似乎是 运行 两者的原因是因为,从技术上讲,正如我上面所描述的,它确实 运行 两者。由于 print()
语句的位置,您看到此打印两次。
它进入 try
块,然后在开头打印带有 print(i)
的 i 。在 print(i)
之后的 try
块中的某个点,引发了 error/exception,然后它转到 except
块,再次打印 i 和 print(i)
在该块的开头。
如果你想让它寻找一个条件并且只执行你想要的那个,那么你需要使用 if
块来检查条件,而不是 try/except
.
话虽如此,与使用 Selenium 渲染相比,从源获取数据要高效得多。您还会获得更多的数据。我不知道您希望从响应中得到什么,但这就是您得到的:click here
代码:
import requests
from bs4 import BeautifulSoup
df3 = ['TotalEnergies','Engie','BNP']
currencies = []
for i in df3:
url = f'https://www.boursorama.com/recherche/ajax?query={i}&searchId='
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
symbol = soup.find('a', {'class':'search__list-link'})['href'].split('/')[-2]
url = 'https://www.boursorama.com/bourse/action/graph/ws/GetTicksEOD'
payload = {
'symbol': symbol,
'length': '1',
'period': '0',
'guid': ''}
jsonData = requests.get(url, params=payload).json()
data = jsonData['d']
name = data['Name']
qd = data ['qd']['c']
currencies.append(qd)
print(f'{name}: {qd}')
print(currencies)
输出:
TOTALENERGIES: 45.59
ENGIE: 11.07
BNP PARIBAS: 49.03
[45.59, 11.07, 49.03]