将 DataFrame 附加到索引 Pandas
Append DataFrame to an Index Pandas
我的数据中有很多嵌套。我有 6 个时间段(但我们不用担心),每个时间段有 19 个分位数,每个分位数有一个 51x51 协方差矩阵(适用于美国的所有州和哥伦比亚特区)。如果表示为字典,我将有:
my_data = {'time_pd_1' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)},
'time_pd_2' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)},
...
'time_pd_6' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)}}
很简单,但数据不是这样创建的。我有两个 for
循环来完成这项工作:
for tpd in time_periods:
for q in quantiles:
tdf = pd.DataFrame(data=cov_var(data_for_q), index=states, columns=states)
如果我要打印 tdf
它看起来像这样:
ST Alabama Alaska Arizona ... West Virginia Wisconsin Wyoming
ST
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
现在,我想要的是:
cov = {}
for tpd in time_periods:
cov[tpd] = pd.DataFrame(index=[str(round(q,2)) for q in quantiles])
for q in quantiles:
tdf = pd.DataFrame(data=cov_var(data_for_q), index=states, columns=states)
cov[tpd].loc[str(round(q,2)), :] = tdf
所以如果我打印 cov[tpd]
它应该看起来像:
ST Alabama Alaska Arizona ... West Virginia Wisconsin Wyoming
q ST
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.05 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.10 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.90 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.95 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
拥有这个最终结构将使我的生活变得更加轻松,以至于我愿意为得到它的人买一瓶啤酒。除此之外,我尝试了各种方法:
cov[tpd].loc[str(round(q,2)), :] = tdf # Raises ValueError: Incompatible indexer with DataFrame
cov[tpd].loc[str(round(q,2)), :].append(tdf) # Almost gives me the frame I need, but removes the index level q, and inserts a column 0 with NaNs
cov[tpd].loc[str(round(q,2)), :].join(tdf, how='outer') # Raises AttributeError: 'Series' object has no attribute 'join'
pd.merge(cov[tpd].loc[str(round(q,2)), :], tdf, how='outer') # Raises AttributeError: 'Series' object has no attribute 'columns'
我了解所有错误消息,而且我还有一个潜在的修复方法,它涉及按照我想要的方式预先创建 DataFrame cov[tpd]
,然后使用索引从 cov_var()
插入输出.但这是为 cov[tpd]
创建多索引然后插入数据的几行额外代码。有人知道更好的方法吗?
注意:cov_var()
是我写的一个简单的协方差计算函数,因为我的情况有点特殊,我不能使用像np.cov()
这样的内置函数。
所以我终于妥协了,用了我在上面问题中暗示的方法。它实际上似乎比我坚持尝试的方法要快。一切都很好。这是我最后做的:
cov = {}
ind_lev_1 = [str(round(q,2)) for q in quantiles]
ind_lev_2 = states
index = pd.MultiIndex.from_product([ind_lev_1, ind_lev_2], names=['QUANTILE', 'STATE'])
columns = pd.Index(ind_lev_2, name='STATE')
for tpd in time_periods:
cov[tpd] = pd.DataFrame(index=index, columns=columns)
for q in quantiles:
q = str(round(q,2))
cov[tpd].loc[(q,), :] = cov_var(arr=data_for_q, means=pop_means_for_q)
我的数据中有很多嵌套。我有 6 个时间段(但我们不用担心),每个时间段有 19 个分位数,每个分位数有一个 51x51 协方差矩阵(适用于美国的所有州和哥伦比亚特区)。如果表示为字典,我将有:
my_data = {'time_pd_1' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)},
'time_pd_2' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)},
...
'time_pd_6' : {0.05 : pd.DataFrame(data=cov_var(data_for_0.05), columns=states, index=states),
{0.10 : pd.DataFrame(data=cov_var(data_for_0.10), columns=states, index=states),
...
{0.90 : pd.DataFrame(data=cov_var(data_for_0.90), columns=states, index=states),
{0.95 : pd.DataFrame(data=cov_var(data_for_0.95), columns=states, index=states)}}
很简单,但数据不是这样创建的。我有两个 for
循环来完成这项工作:
for tpd in time_periods:
for q in quantiles:
tdf = pd.DataFrame(data=cov_var(data_for_q), index=states, columns=states)
如果我要打印 tdf
它看起来像这样:
ST Alabama Alaska Arizona ... West Virginia Wisconsin Wyoming
ST
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
现在,我想要的是:
cov = {}
for tpd in time_periods:
cov[tpd] = pd.DataFrame(index=[str(round(q,2)) for q in quantiles])
for q in quantiles:
tdf = pd.DataFrame(data=cov_var(data_for_q), index=states, columns=states)
cov[tpd].loc[str(round(q,2)), :] = tdf
所以如果我打印 cov[tpd]
它应该看起来像:
ST Alabama Alaska Arizona ... West Virginia Wisconsin Wyoming
q ST
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.05 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.10 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.90 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
Alabama 288.867628 50.000000 -100.062576 ... 37.719317 0 -75.000000
Alaska 50.000000 280.929272 -229.365427 ... 57.514555 0 -136.365512
Arizona -100.062576 -229.365427 946.563177 ... -113.805612 0 291.897723
0.95 ... ... ... ... ... ... ... ...
West Virginia 37.719317 57.514555 -113.805612 ... 342.195976 0 -214.243277
Wisconsin 0.000000 0.000000 0.000000 ... 0.000000 0 0.000000
Wyoming -75.000000 -136.365512 291.897723 ... -214.243277 0 684.146619
拥有这个最终结构将使我的生活变得更加轻松,以至于我愿意为得到它的人买一瓶啤酒。除此之外,我尝试了各种方法:
cov[tpd].loc[str(round(q,2)), :] = tdf # Raises ValueError: Incompatible indexer with DataFrame
cov[tpd].loc[str(round(q,2)), :].append(tdf) # Almost gives me the frame I need, but removes the index level q, and inserts a column 0 with NaNs
cov[tpd].loc[str(round(q,2)), :].join(tdf, how='outer') # Raises AttributeError: 'Series' object has no attribute 'join'
pd.merge(cov[tpd].loc[str(round(q,2)), :], tdf, how='outer') # Raises AttributeError: 'Series' object has no attribute 'columns'
我了解所有错误消息,而且我还有一个潜在的修复方法,它涉及按照我想要的方式预先创建 DataFrame cov[tpd]
,然后使用索引从 cov_var()
插入输出.但这是为 cov[tpd]
创建多索引然后插入数据的几行额外代码。有人知道更好的方法吗?
注意:cov_var()
是我写的一个简单的协方差计算函数,因为我的情况有点特殊,我不能使用像np.cov()
这样的内置函数。
所以我终于妥协了,用了我在上面问题中暗示的方法。它实际上似乎比我坚持尝试的方法要快。一切都很好。这是我最后做的:
cov = {}
ind_lev_1 = [str(round(q,2)) for q in quantiles]
ind_lev_2 = states
index = pd.MultiIndex.from_product([ind_lev_1, ind_lev_2], names=['QUANTILE', 'STATE'])
columns = pd.Index(ind_lev_2, name='STATE')
for tpd in time_periods:
cov[tpd] = pd.DataFrame(index=index, columns=columns)
for q in quantiles:
q = str(round(q,2))
cov[tpd].loc[(q,), :] = cov_var(arr=data_for_q, means=pop_means_for_q)