内存效率 Python (pandas) 每个时期来自一个 csv 文件的类别聚合
memory efficient Python (pandas) aggregates of categories from one csv file per period
我正在尝试避免 pandas 或 IOPro(仍在调查)的分段错误,因此我正在寻找替代解决方案,尤其是。更有效率的。下面的代码在小数据下运行良好,但在具有 256 GB RAM、版本 pandas 0.16.2 np19py26_0、iopro 1.7 的 Linux 服务器上读取几 GB 的 90 个每月面板时崩溃。 1 np19py27_p0 和 python 2.7.10 0.
我在这里做的是汇总每个人 (LopNr) 和每个月的药品购买记录(成本在 TKOST 中)的帐户,同时还使用 ATC 代码将药品分类。
所以虽然原始数据看起来像这样,但在每月的 csv 文件中(在这里说 2006 年 7 月,csv 中有许多我不需要的其他列):
LopNr TKOST ATC
1 5 N01
1 11 N01
1 6 N15
等等
我想要聚合面板,其中的行像
LopNr TKOST year month
1 22 2006 7
要么单独针对几个类别(例如,此处以 N 开头的 ATC 的神经),要么在单个数据文件中对这些类别进行单独的摘要(因此使用神经列等)。
我选择了 IOPro 而不是简单 pandas 以提高内存效率,但现在我遇到了分段错误。
# -*- coding: utf-8 -*-
import iopro
from pandas import *
neuro = DataFrame()
cardio = DataFrame()
cancer = DataFrame()
addiction = DataFrame()
Adrugs = DataFrame()
Mdrugs = DataFrame()
Vdrugs = DataFrame()
all_drugs = DataFrame()
for year in xrange(2005,2013):
for month in xrange(1,13):
if year == 2005 and month < 7:
continue
filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
monthly = adapter[['LopNr','ATC','TKOST']][:]
monthly['year']=year
monthly['month']=month
neuro = neuro.append(monthly[(monthly.ATC.str.startswith('N')) & (~(monthly.TKOST.isnull()))])
cardio = cardio.append(monthly[(monthly.ATC.str.startswith('C')) & (~(monthly.TKOST.isnull()))])
cancer = cancer.append(monthly[(monthly.ATC.str.startswith('L')) & (~(monthly.TKOST.isnull()))])
addiction = addiction.append(monthly[(monthly.ATC.str.startswith('N07')) & (~(monthly.TKOST.isnull()))])
Adrugs = Adrugs.append(monthly[(monthly.ATC.str.startswith('A')) & (~(monthly.TKOST.isnull()))])
Mdrugs = Mdrugs.append(monthly[(monthly.ATC.str.startswith('M')) & (~(monthly.TKOST.isnull()))])
Vdrugs = Vdrugs.append(monthly[(monthly.ATC.str.startswith('V')) & (~(monthly.TKOST.isnull()))])
all_drugs = all_drugs.append(monthly[(~(monthly.TKOST.isnull()))])
del monthly
all_drugs = all_drugs.groupby(['LopNr','year','month']).sum()
all_drugs = all_drugs.astype(int,copy=False)
all_drugs.to_csv('PATH/monthly_all_drugs_costs.csv')
del all_drugs
neuro = neuro.groupby(['LopNr','year','month']).sum()
neuro = neuro.astype(int,copy=False)
neuro.to_csv('PATH/monthly_neuro_costs.csv')
del neuro
cardio = cardio.groupby(['LopNr','year','month']).sum()
cardio = cardio.astype(int,copy=False)
cardio.to_csv('PATH/monthly_cardio_costs.csv')
del cardio
cancer = cancer.groupby(['LopNr','year','month']).sum()
cancer = cancer.astype(int,copy=False)
cancer.to_csv('PATH/monthly_cancer_costs.csv')
del cancer
addiction = addiction.groupby(['LopNr','year','month']).sum()
addiction = addiction.astype(int,copy=False)
addiction.to_csv('PATH/monthly_addiction_costs.csv')
del addiction
Adrugs = Adrugs.groupby(['LopNr','year','month']).sum()
Adrugs = Adrugs.astype(int,copy=False)
Adrugs.to_csv('PATH/monthly_Adrugs_costs.csv')
del Adrugs
Mdrugs = Mdrugs.groupby(['LopNr','year','month']).sum()
Mdrugs = Mdrugs.astype(int,copy=False)
Mdrugs.to_csv('PATH/monthly_Mdrugs_costs.csv')
del Mdrugs
Vdrugs = Vdrugs.groupby(['LopNr','year','month']).sum()
Vdrugs = Vdrugs.astype(int,copy=False)
Vdrugs.to_csv('PATH/monthly_Vdrugs_costs.csv')
del Vdrugs
您的代码非常重复,可以通过字典和列表推导式进行简化。此解决方案应消除您的内存问题,因为您一次只处理一个月的数据(尽管您有越来越多的月度摘要列表,我认为这不会占用太多内存)。
我无法对此进行测试,但我相信它会完成上面代码中的所有操作。
import pandas as pd
import iopro
items = {'neuro': 'N',
'cardio': 'C',
'cancer': 'L',
'addiction': 'N07',
'Adrugs': 'A',
'Mdrugs': 'M',
'Vdrugs': 'V',
'all_drugs': ''}
# 1. Create data container using dictionary comprehension.
monthly_summaries = {item: list() for item in items.keys()}
# 2. Perform monthly groupby operations.
for year in xrange(2005, 2013):
for month in xrange(1, 13):
if year == 2005 and month < 7:
continue
filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
adapter = iopro.text_adapter(filename,
parser='csv',
field_names=True,
output='data frame',
delimiter='\t')
monthly = adapter[['LopNr','ATC','TKOST']][:]
monthly['year'] = year
monthly['month'] = month
dfs = {name: monthly[(monthly.ATC.str.startswith('{0}'.format(code)))
& (~(monthly.TKOST.isnull()))]
for name, code in items.iteritems()}
[monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
.astype(int, copy=False))
for name in items.keys()]
# 3. Now concatenate all of the monthly summaries into separate DataFrames.
dfs = {name: pd.concat([monthly_summaries[name], ignore_axis=True])
for name in items.keys()}
# 4. Now regroup the aggregate monthly summaries.
monthly_summaries = {name: dfs[name].reset_index().groupby(['LopNr','year','month']).sum()
for name in items.keys()}
# 5. Finally, save the aggregated results to files.
[monthly_summaries[name].to_csv('PATH/monthly_{0}_costs.csv'.format(name))
for name in items()]
我正在尝试避免 pandas 或 IOPro(仍在调查)的分段错误,因此我正在寻找替代解决方案,尤其是。更有效率的。下面的代码在小数据下运行良好,但在具有 256 GB RAM、版本 pandas 0.16.2 np19py26_0、iopro 1.7 的 Linux 服务器上读取几 GB 的 90 个每月面板时崩溃。 1 np19py27_p0 和 python 2.7.10 0.
我在这里做的是汇总每个人 (LopNr) 和每个月的药品购买记录(成本在 TKOST 中)的帐户,同时还使用 ATC 代码将药品分类。
所以虽然原始数据看起来像这样,但在每月的 csv 文件中(在这里说 2006 年 7 月,csv 中有许多我不需要的其他列):
LopNr TKOST ATC
1 5 N01
1 11 N01
1 6 N15
等等
我想要聚合面板,其中的行像
LopNr TKOST year month
1 22 2006 7
要么单独针对几个类别(例如,此处以 N 开头的 ATC 的神经),要么在单个数据文件中对这些类别进行单独的摘要(因此使用神经列等)。
我选择了 IOPro 而不是简单 pandas 以提高内存效率,但现在我遇到了分段错误。
# -*- coding: utf-8 -*-
import iopro
from pandas import *
neuro = DataFrame()
cardio = DataFrame()
cancer = DataFrame()
addiction = DataFrame()
Adrugs = DataFrame()
Mdrugs = DataFrame()
Vdrugs = DataFrame()
all_drugs = DataFrame()
for year in xrange(2005,2013):
for month in xrange(1,13):
if year == 2005 and month < 7:
continue
filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
monthly = adapter[['LopNr','ATC','TKOST']][:]
monthly['year']=year
monthly['month']=month
neuro = neuro.append(monthly[(monthly.ATC.str.startswith('N')) & (~(monthly.TKOST.isnull()))])
cardio = cardio.append(monthly[(monthly.ATC.str.startswith('C')) & (~(monthly.TKOST.isnull()))])
cancer = cancer.append(monthly[(monthly.ATC.str.startswith('L')) & (~(monthly.TKOST.isnull()))])
addiction = addiction.append(monthly[(monthly.ATC.str.startswith('N07')) & (~(monthly.TKOST.isnull()))])
Adrugs = Adrugs.append(monthly[(monthly.ATC.str.startswith('A')) & (~(monthly.TKOST.isnull()))])
Mdrugs = Mdrugs.append(monthly[(monthly.ATC.str.startswith('M')) & (~(monthly.TKOST.isnull()))])
Vdrugs = Vdrugs.append(monthly[(monthly.ATC.str.startswith('V')) & (~(monthly.TKOST.isnull()))])
all_drugs = all_drugs.append(monthly[(~(monthly.TKOST.isnull()))])
del monthly
all_drugs = all_drugs.groupby(['LopNr','year','month']).sum()
all_drugs = all_drugs.astype(int,copy=False)
all_drugs.to_csv('PATH/monthly_all_drugs_costs.csv')
del all_drugs
neuro = neuro.groupby(['LopNr','year','month']).sum()
neuro = neuro.astype(int,copy=False)
neuro.to_csv('PATH/monthly_neuro_costs.csv')
del neuro
cardio = cardio.groupby(['LopNr','year','month']).sum()
cardio = cardio.astype(int,copy=False)
cardio.to_csv('PATH/monthly_cardio_costs.csv')
del cardio
cancer = cancer.groupby(['LopNr','year','month']).sum()
cancer = cancer.astype(int,copy=False)
cancer.to_csv('PATH/monthly_cancer_costs.csv')
del cancer
addiction = addiction.groupby(['LopNr','year','month']).sum()
addiction = addiction.astype(int,copy=False)
addiction.to_csv('PATH/monthly_addiction_costs.csv')
del addiction
Adrugs = Adrugs.groupby(['LopNr','year','month']).sum()
Adrugs = Adrugs.astype(int,copy=False)
Adrugs.to_csv('PATH/monthly_Adrugs_costs.csv')
del Adrugs
Mdrugs = Mdrugs.groupby(['LopNr','year','month']).sum()
Mdrugs = Mdrugs.astype(int,copy=False)
Mdrugs.to_csv('PATH/monthly_Mdrugs_costs.csv')
del Mdrugs
Vdrugs = Vdrugs.groupby(['LopNr','year','month']).sum()
Vdrugs = Vdrugs.astype(int,copy=False)
Vdrugs.to_csv('PATH/monthly_Vdrugs_costs.csv')
del Vdrugs
您的代码非常重复,可以通过字典和列表推导式进行简化。此解决方案应消除您的内存问题,因为您一次只处理一个月的数据(尽管您有越来越多的月度摘要列表,我认为这不会占用太多内存)。
我无法对此进行测试,但我相信它会完成上面代码中的所有操作。
import pandas as pd
import iopro
items = {'neuro': 'N',
'cardio': 'C',
'cancer': 'L',
'addiction': 'N07',
'Adrugs': 'A',
'Mdrugs': 'M',
'Vdrugs': 'V',
'all_drugs': ''}
# 1. Create data container using dictionary comprehension.
monthly_summaries = {item: list() for item in items.keys()}
# 2. Perform monthly groupby operations.
for year in xrange(2005, 2013):
for month in xrange(1, 13):
if year == 2005 and month < 7:
continue
filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
adapter = iopro.text_adapter(filename,
parser='csv',
field_names=True,
output='data frame',
delimiter='\t')
monthly = adapter[['LopNr','ATC','TKOST']][:]
monthly['year'] = year
monthly['month'] = month
dfs = {name: monthly[(monthly.ATC.str.startswith('{0}'.format(code)))
& (~(monthly.TKOST.isnull()))]
for name, code in items.iteritems()}
[monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
.astype(int, copy=False))
for name in items.keys()]
# 3. Now concatenate all of the monthly summaries into separate DataFrames.
dfs = {name: pd.concat([monthly_summaries[name], ignore_axis=True])
for name in items.keys()}
# 4. Now regroup the aggregate monthly summaries.
monthly_summaries = {name: dfs[name].reset_index().groupby(['LopNr','year','month']).sum()
for name in items.keys()}
# 5. Finally, save the aggregated results to files.
[monthly_summaries[name].to_csv('PATH/monthly_{0}_costs.csv'.format(name))
for name in items()]