该网站是否受到保护以防止抓取？

Question

我正在尝试从该网站抓取年鉴： http://www.presanse.fr/CISME/annuaire.aspx 为了向您展示我需要 scrape 的信息，请单击“tous les services”，然后会显示一个列表，然后单击一个项目（ex AST-BTP），然后会出现一个页面显示大量信息（我需要所有信息）。我尝试检查代码，我注意到 <"div",class="ficheCorneeDetails"> 包含此信息，但我无法 scrape它，我的剧本return'None' 感谢您的帮助！

Answer 1

你想要的信息是使用JavaScript脚本加载的，简单地使用爬虫发出请求是行不通的。

您需要使用 Selenium

之类的东西来模拟按钮上的点击

Answer 2

要做到这一点，除了 Beautiful soup 之外，您还需要利用硒。

1) 在此处下载 geckoDriver (fire fox) https://github.com/mozilla/geckodriver/releases

2) 提取 exe 并将其添加到您的系统路径

3) 使用 pip install selenium

安装 selenium

4) 运行以下内容：

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get('http://www.presanse.fr/CISME/annuaire.aspx')


availbutton = driver.find_element_by_id('ctl00_cphMiddle_UC_RechercheParCarte1_linkTousLesServices')
availbutton.click()
time.sleep(2)

html = driver.page_source
soup = BeautifulSoup(html,'lxml')

targetDiv = soup.find_all("div", {"class": "resultatTable"})
targetsoup = BeautifulSoup(str(targetDiv),'lxml')
for span in targetsoup:
    print(span.text)

driver.close()

您可以与之前动态创建的元素进行交互，也可以使用 button.click() 单击 DOM 个元素。我添加了 2 秒延迟以允许 table 加载，因为我最初仍然在没有时间加载它的情况下得到空白！

该网站是否受到保护以防止抓取？

Is this website protected against scraping?

python

screen-scraping

beautifulsoup