如何从select语句中批量调用数据并追加到dataframe中?

How to call data in batches from select statement and append into dataframe?

我有一个包含 sql 语句的文件,我正在使用 pyodbc 读入 python。 sql 语句只是一个 select 语句,如下所示:

select distinct (columns) from table1

然而我调用的数据是3000万行。

我可以为较小的表执行此操作,并将信息放入数据框中。

是否可以对 select 语句进行批处理以仅提取 X 行并附加到数据框中并继续这样做直到 3000 万条记录结束?

到目前为止的代码:

import os.path
import pandas as pd
import tinys3
import psycopg2
import pyodbc
from datetime import datetime
import uuid
import glob
from os import listdir
from os.path import isfile, join
import time

startTime = datetime.now()

#reading in data for db
server = 'xxxx' 
database = 'xxx' 
username = 'xxx' 
password = 'xxxx' 
driver= '{ODBC Driver 17 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=xxx;DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
path = "path/to/folder"




for infile in glob.glob( os.path.join(path, '*.sql') ):
    with open(infile, 'r') as myfile:
        sql = myfile.read()
        print(sql)
        myfile.close()
        cursor.execute(sql)

        row = cursor.fetchall()
        columns = [column[0] for column in cursor.description]
        columns = [element.lower() for element in columns]

        df = pd.DataFrame([tuple(t) for t in row])
        df.columns = columns

您可以使用 fetchmany function:

cursor.fetchmany([size=cursor.arraysize]) --> list

Returns a list of remaining rows, containing no more than size rows, used to process results in chunks. The list will be empty when there are no more rows.

The default for cursor.arraysize is 1 which is no different than calling fetchone().

A ProgrammingError exception is raised if no SQL has been executed or if it did not return a result set (e.g. was not a SELECT statement).

这将允许您分块提取数据。

使用示例:

while True:
    three_rows = cursor.fetchmany(3)
    # every loop cycle, 3 rows are selected
    if not results:
        break;
    print(three_rows)

您也可以使用 fetchdone 函数,逐行处理数据。

fetchone

cursor.fetchone() --> Row or None

Returns the next row or None when no more data is available.