创建 POST 请求以 python 抓取网站，其中没有网络表单数据更改

Question

我正在抓取一个使用 javascript 动态呈现的网站。点击 > 按钮时 url 不会改变所以我一直在尝试查看网络部分的检查器，更具体地说是“请求 Url”和“请求方法”以及“表单数据”部分中寻找任何类型的 ID，这些 ID 可以是唯一的，以区分每个连续的页面。但是，当记录逐页单击 > 按钮的日志时，“表单数据”数据似乎每次都相同（参见图片）：

目前我的代码没有包含此方法，因为在我可以在“表单数据”部分中找到唯一标识符之前我看不到它有帮助。但是，如果有帮助，我可以展示我的代码。本质上，它只是在我的 while 循环中一遍又一遍地拉取第一页数据，即使我使用的是带硒的驱动程序并在尝试使用 BeautifulSoup.[= 获取数据之前使用 driver.find_elements_by_xpath("xpath of > button").click() 19=]

（更新代码见评论）

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []


def extract_info(html_source):
    # html_source will be inner HTMl of table
    global lst
    soup = BeautifulSoup(html_source, 'html.parser')
    lst = soup.find('tbody').find_all('tr')[0]
    masters_list.append(lst)

    # i am printing just id because it's id set as crypto name you have to do more scraping to get more info


chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True

while loop:  # loop for extrcting all 120 pages
    crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
        'innerHTML')  # this is for crypto data table

    extract_info(crypto_table)

    paginate = driver.find_element(
        By.ID, "DataTables_Table_0_paginate")  # all table pagination
    pages_list = paginate.find_elements(By.TAG_NAME, 'li')
    # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
    next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')

    # checking is there next page available
    if "disabled" in next_page_link.get_attribute('class'):
        loop = False

    pages_list[-1].click()  # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()

Answer 1

我正在使用我自己的代码来展示我是如何获得 table 我添加解释作为重要行的评论

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

def extract_info(html_source):
    soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table 
    lst = soup.find('tbody').find_all('tr')
    for i in lst:
        print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info



driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True

while loop: #loop for extrcting all 120 pages 
    crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table 

    print(extract_info(crypto_table))

    paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination 
    pages_list  = paginate.find_elements(By.TAG_NAME,'li')
    next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link

    if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available 
        loop = False

    pages_list[-1].click() # if there next page available then click on it

所以你的问题的主要答案是当你点击按钮时，selenium 更新页面然后你可以使用 driver.page_source 来更新 html。有时（*不是这个 url）页面可以有 ajax 请求，这可能需要一些时间，所以你必须等到 selenium 加载整个页面。

创建 POST 请求以 python 抓取网站，其中没有网络表单数据更改

Creating POST request to scrape website with python where no network form data changes

python

post

web-scraping

selenium-webdriver