爬行分页机械化 python

Question

我正在使用 mechanize & python 来抓取网站并获取数据。到目前为止，我能够提交表单并从该页面获取内容。但是我无法触发点击 "Next Page" link 并获取数据。我的代码如下：

import re
import mechanize
from bs4 import BeautifulSoup

br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://portal.uspto.gov/EmployeeSearch/")
br.select_form(name="SearchEmployeeDataBean")
br.form['name'] = 'a'
response = br.submit()

soup = BeautifulSoup(response)
table = soup.find_all('table')[16]
rows = table.find_all('tr')
data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in rows]
for a in data:
    if a:
        examiner = " ".join(a[0][1].split())
        phone = a[1][1]
        extension_office = a[3][1]
        office_description = "|".join(re.findall(r'\d+', a[4][1]))
        # print(examiner, phone, extension_office, office_description)

现在在结果页面上有包含文本 "Next Page >>" 的按钮。我尝试使用以下代码点击它：

按钮HTML:

<a onclick="javascript:goToPage('currentPage', '3')" href="#">Next Page &gt;&gt;</a>

Python代码：

req = br.click_link(text_regex='Next Page >>')
r2 = br.open(req)
r2soup = BeautifulSoup(r2)

但是没有成功。

请帮助我如何单击下一步按钮并从那里获取数据直到没有下一页。

Answer 1

我发现了 mechanize 不支持的问题 javascript。每当 mechanize 在提交后到达页面时，javascript 由于未触发分页点击而无法正常工作。我已经使用 selenium 实现了我想要的。和 Beautiful Soup 使用以下硒选择器：

elem1 = driver.find_element_by_link_text("Next Page >>")
elem1.click()

爬行分页机械化 python

crawling through pagination mechanize python

python

mechanize