无法使用请求从下一页抓取名称

Can't scrape names from next pages using requests

我正在尝试使用 python 脚本解析从网页遍历多个页面的名称。通过我目前的尝试,我可以从它的登陆页面获取名称。但是,我找不到使用请求和 BeautifulSoup.

从下一页获取名称的任何想法

website link

我目前的尝试:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

我已经尝试修改我的脚本以仅从第二页获取内容以确保它在涉及下一页按钮时正常工作但不幸的是它仍然从第一页获取数据:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTARGUMENT'] = 'Page$Next'
    payload.pop('btnClose')
    payload.pop('btnMapClose')
    res = s.post(url,data=payload,headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
        })
    sauce = BeautifulSoup(res.text,"lxml")
    for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

正在通过 POST 请求使用 __VIEWSTATE 光标导航到下一页。

如何处理请求:

  1. 向第一页发出 GET 请求;

  2. 解析所需数据和__VIEWSTATE游标;

  3. 准备 POST 请求下一页已收到光标;

  4. 运行 它,解析下一页的所有数据和新游标。

我不会提供任何代码,因为它需要写下几乎所有的爬虫代码。

==== 添加====

您几乎完成了,但是您错过了两件重要的事情。

  1. 需要在第一个 GET 请求中发送 headers。如果没有发送 headers - 我们得到损坏的令牌(很容易在视觉上检测 - 它们最后没有 ==)

  2. 我们需要将__ASYNCPOST添加到我们发送的payload中。 (很有意思:不是boolean True,是字符串'true')

这是代码。我删除了 bs4 并添加了 lxml(我不喜欢 bs4,它很慢)。我们确切地知道我们需要发送哪些数据,所以让我们只解析几个输入。

import re
import requests
from lxml import etree


def get_nextpage_tokens(response_body):
    """ Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
    try:
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
        payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
        payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
        payload['__ASYNCPOST'] = 'true'
        return payload
    except:
        return None


if __name__ == '__main__':
    url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
            }

    with requests.Session() as s:
        page_num = 1
        r = s.get(url, headers=headers)
        parser = etree.HTMLParser()
        tree = etree.fromstring(r.text, parser)

        # Creating payload
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATE'] = tree.xpath("//input[@name='__VIEWSTATE']/@value")[0]
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]
        payload['__EVENTVALIDATION'] = tree.xpath("//input[@name='__EVENTVALIDATION']/@value")[0]
        payload['__ASYNCPOST'] = 'true'
        headers['X-Requested-With'] = 'XMLHttpRequest'

        while True:
            page_num += 1
            res = s.post(url, data=payload, headers=headers)

            print(f'page {page_num} data: {res.text}')  # FIXME: Parse data

            payload = get_nextpage_tokens(res.text)  # Creating payload for next page
            if not payload:
                # Break if we got no tokens - maybe it was last page (it must be checked)
                break

重要

回复格式不正确HTML。所以你必须处理它: cut table 或其他东西。祝你好运!