使用 Selenium 和 Python 处理超时

Question

有人可以帮我解决这个问题吗？我已经使用 Selenium 编写了一个代码来从中文新闻网站上抓取文章。由于许多 url 未加载，我尝试包含代码以捕获超时异常，这有效，但浏览器似乎停留在加载时超时的页面上，而不是尝试下一个 url。

我尝试在处理错误后添加 driver.quit() 和 driver.close()，但是在继续下一个循环时它不起作用。

with open('url_list_XB.txt', 'r') as f:
    url_list = f.readlines()

for idx, url in enumerate(url_list):
    status = str(idx)+" "+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title', str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt', 'a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split("　来源")[0]
                            dt2 = dt.replace(":", "_").replace("-", "_").replace(" ", "_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url, tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

我想让代码在 url 的列表中移动，调用它们并抓取文章文本以及 link 到讨论页面。它一直工作到页面加载超时，然后程序将不会继续。

Answer 1

我不能完全理解你的代码在做什么，因为我没有你正在自动化的页面的上下文，但我可以提供一个通用的结构来说明你将如何完成这样的事情。这是我如何处理您的情况的简化版本：

# iterate URL list
for url in url_list:

    # navigate to a URL
    driver.get(url)

    # check something here to test if a link is 'broken' or not
    try: 
        driver.find_element(someLocator)

    # if link is broken, go back
    except TimeoutException:
        driver.back()
        # continue so we can return to beginning of loop
        continue

    # if you reach this point, the link is valid, and you can 'do stuff' on the page

此代码导航到 URL，并执行一些检查（您指定）以查看 link 是否为 'broken'。我们通过捕获抛出的 TimeoutException 来检查损坏的 link 。如果抛出异常，我们导航到上一页，然后调用 continue 到 return 到循环的开始，并从下一个 URL.

重新开始

如果我们通过 try / except 块，那么 URL 是有效的，我们在正确的页面上。在这里，您可以编写代码来抓取文章或任何您需要做的事情。

出现在 try / except 之后的代码只有在没有遇到 TimeoutException 时才会被命中——这意味着 URL 是有效的。

使用 Selenium 和 Python 处理超时

Handling timeout with Selenium and Python

python

selenium

timeout

exception

web-scraping