Python

Question

我正在创建一个下载文件，以便从与我雇主公司相关的网站自动下载 PDF。

看起来 PDF 包含在 JQueryFileTree 中。有没有一种方法可以下载下面的文件夹之一并将其与内容 PDF 一起保存到磁盘？

到目前为止，我正在使用 Python 和 selenium 来自动登录等

谢谢

到目前为止我的代码：

from selenium import webdriver
from time import sleep 
import requests
from bs4 import BeautifulSoup as bs 

import secrets

class manual_grabber():
    """ A class creating a manual downloader for the Roger Technology website """
    def __init__(self):
        """ Initialize attributes of manual grabber """
        self.driver = webdriver.Chrome('\Users\Joel\Desktop\Python\manual_grabber\chromedriver.exe')

    def login(self):
        """ Function controlling the login logic """
        self.driver.get('urltosite')

        sleep(1)

        # Locate elements and enter login details
        user_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[6]')
        user_in.send_keys(secrets.username)   

        pass_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[7]')
        pass_in.send_keys(secrets.password)

        enter_button = self.driver.find_element_by_xpath('/html/body/div[2]/form/div/input')
        enter_button.click()
        
        # Click Self Service Area button
        self_service_button = self.driver.find_element_by_xpath('//*[@id="bs-example-navbar-collapse-1"]/ul/li[1]/a')
        self_service_button.click()


grab = manual_grabber()
grab.login()

文件结构是这样的：

单击这些文件夹之一时，它会在树右侧的 window 中打开内容 PDF。

以及 DOM：

Answer 1

根据 DOM 和屏幕截图，您可以从 jquery 树中 select。

您可以使用左侧树中 select 的 xpath:

//a[.//nobr[text()='Products catalogue and brouches']]

对于这个 xpath:

//a 是亲戚 a（页面上的任意位置）
启动 [ 意味着我们将其标识为...
. 从这个位置（即 a 下方）child
//nobr 任何 nobr 标签
其中 text() = 'Products catalogue and brochures'（区分大小写）

简而言之，它是任何具有 child nobr 的 a，其中包含我们想要的文本。

我把它放在简单的页面中，它唯一匹配输出：

还有很多 xpath 标识符和方法 - this is great learning resource。

希望使用该 xpath，您只需输入要单击的文本，它就会为您完成。

如果找不到您的标识符 - 请告诉我，我会再查看。

第一季度的一些额外想法：

根据您的应用程序以及树的工作方式，您可能需要 a wait strategy。

有两种主要方法，显式和隐式。为了简单起见（因为这个答案已经足够长了），首先尝试隐式等待：

driver.implicitly_wait(10)

在你的 __init__ 中设置 ONCE 并且 selenium 将等待最多 10 秒然后抱怨不存在 objects。（如果不起作用，请告诉我！）

您可能还需要滚动树以将元素显示在视图中。在这种情况下，试试这个 - 只需用上面的 xpath 识别你的元素并将它传递到这里：

def ScrollIntoView(element):
    actions = ActionChains(driver)
    actions.move_to_element(element).perform()

你问题的下一部分是下载文件。单击右侧的链接。

您将需要分享更多 DOM - 如果您可以分享 2 或 3 个 a 或 'img' 启动下载我'您将能够提供定制的回复。

在一般条款中，如果您想获得“所有”下载链接，您需要执行以下操作：

#get all the elements - needs a condition that identifies just these anchors
allLinks = driver.find_elements_by_xpath(some condition)

#loop through all links and action the click
for link in allLinks:
    link.click()
    time.sleep(3) # don't download too many at once - depending on their size let them complete

可以在 selenium 中下载文件，但无法获取下载大小、进度或速度。本质上，一旦它开始，你就对它视而不见。

有替代方案，但它们会增加复杂性。

如果您需要更多帮助，请随时向我发送额外 DOM 内容。很高兴看到更多并提供更多支持。

根据下面的评论更新。要管理 iframe，您需要识别它并切换到它以执行您的操作。我无法测试 xpath - 所以试一试并根据需要更新它。

frame = driver.find_element_by_xpath('//iframe[contains(@src,"ManageFiles")]')
driver._switch_to.frame(frame)
#do the actions on the frame
#when ready...
driver._switch_to.default_content()

最后，请务必切换回默认内容或下一个需要与之交互的 iframe。

切换、查找和点击并等待：

WebDriverWait(driver, 15).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,'//iframe[contains(@src,"ManageFiles")]')))

#Watch spelling and case on this line
myListObject = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH,"//a[.//nobr[text()='Products catalogue and brochures']]")))

myListObject.click()

Python - 下载 JQueryFileTree 中包含的 PDF

Python - Downloading PDF's contained in a JQueryFileTree

jquery

selenium

download

web-scraping