从 Postgres 数据库获取数据时内存使用过多

Question

我一直在使用 Python 从 Postgres 数据库中获取数据。而且它占用了大量内存。如下所示：

下面的函数是我唯一的函数运行，它占用了过多的内存。我正在使用 fetchmany() 并以小块的形式获取数据。我也尝试过迭代地使用 cur 游标。然而，所有这些方法最终都会使用过多的内存。有没有人知道为什么会这样？我需要在 Postgres 端调整什么来帮助缓解这个问题吗？

def checkMultipleLine(dbName):
    '''
    Checks for rows that contain data spanning multiple lines

    This is the most basic of checks. If a aprticular row has 
    data that spans multiple lines, then that particular row
    is corrupt. For dealing with these rows we must first find 
    out whether there are places in the database that contains
    data that spans multiple lines. 
    '''

    logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
    logger.info('Finding rows that span multiple lines')

    schema = findTables(dbName)

    results = []
    for t in tqdm(sorted(schema.keys())):

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()
        cur.execute('select * from %s'%t)
        n = 0
        N = 0
        while True:
            css = cur.fetchmany(1000)
            if css == []: break
            for cs in css:
                N += 1
                if any(['\n' in c for c in cs if type(c)==str]):
                    n += 1
        cur.close()
        conn.close()

        tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
        results.append({
            'tableName': t,
            'totalRows': N,
            'badRows'  : n,
        })


    logger.info('Finished checking for multiple lines')

    results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
    print results
    results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)

    return results

Answer 1

Psycopg2 支持 server-side cursors to be used for large queries as stated in this 。以下是如何将其与客户端缓冲区设置一起使用：

cur = conn.cursor('cursor-name')
cur.itersize = 10000  # records to buffer on a client

这应该会减少内存占用。

从 Postgres 数据库获取数据时内存使用过多

Excessive memory usage while getting data from a Postgres database

python

postgresql

psycopg2

python-2.7