从 Postgres 数据库获取数据时内存使用过多
Excessive memory usage while getting data from a Postgres database
我一直在使用 Python 从 Postgres 数据库中获取数据。而且它占用了大量内存。如下所示:
下面的函数是我唯一的函数运行,它占用了过多的内存。我正在使用 fetchmany()
并以小块的形式获取数据。我也尝试过迭代地使用 cur
游标。然而,所有这些方法最终都会使用过多的内存。有没有人知道为什么会这样?我需要在 Postgres 端调整什么来帮助缓解这个问题吗?
def checkMultipleLine(dbName):
'''
Checks for rows that contain data spanning multiple lines
This is the most basic of checks. If a aprticular row has
data that spans multiple lines, then that particular row
is corrupt. For dealing with these rows we must first find
out whether there are places in the database that contains
data that spans multiple lines.
'''
logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
logger.info('Finding rows that span multiple lines')
schema = findTables(dbName)
results = []
for t in tqdm(sorted(schema.keys())):
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
cur.execute('select * from %s'%t)
n = 0
N = 0
while True:
css = cur.fetchmany(1000)
if css == []: break
for cs in css:
N += 1
if any(['\n' in c for c in cs if type(c)==str]):
n += 1
cur.close()
conn.close()
tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
results.append({
'tableName': t,
'totalRows': N,
'badRows' : n,
})
logger.info('Finished checking for multiple lines')
results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
print results
results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)
return results
Psycopg2 支持 server-side cursors to be used for large queries as stated in this 。以下是如何将其与客户端缓冲区设置一起使用:
cur = conn.cursor('cursor-name')
cur.itersize = 10000 # records to buffer on a client
这应该会减少内存占用。
我一直在使用 Python 从 Postgres 数据库中获取数据。而且它占用了大量内存。如下所示:
下面的函数是我唯一的函数运行,它占用了过多的内存。我正在使用 fetchmany()
并以小块的形式获取数据。我也尝试过迭代地使用 cur
游标。然而,所有这些方法最终都会使用过多的内存。有没有人知道为什么会这样?我需要在 Postgres 端调整什么来帮助缓解这个问题吗?
def checkMultipleLine(dbName):
'''
Checks for rows that contain data spanning multiple lines
This is the most basic of checks. If a aprticular row has
data that spans multiple lines, then that particular row
is corrupt. For dealing with these rows we must first find
out whether there are places in the database that contains
data that spans multiple lines.
'''
logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
logger.info('Finding rows that span multiple lines')
schema = findTables(dbName)
results = []
for t in tqdm(sorted(schema.keys())):
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
cur.execute('select * from %s'%t)
n = 0
N = 0
while True:
css = cur.fetchmany(1000)
if css == []: break
for cs in css:
N += 1
if any(['\n' in c for c in cs if type(c)==str]):
n += 1
cur.close()
conn.close()
tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
results.append({
'tableName': t,
'totalRows': N,
'badRows' : n,
})
logger.info('Finished checking for multiple lines')
results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
print results
results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)
return results
Psycopg2 支持 server-side cursors to be used for large queries as stated in this
cur = conn.cursor('cursor-name')
cur.itersize = 10000 # records to buffer on a client
这应该会减少内存占用。