Python 的 `.loc` 在选择数据子集时真的很慢

Question

我有一个大型多索引 (y,t) 单值 DataFrame df。目前，我正在通过 df.loc[(Y,T), :] 选择一个子集并从中创建一个字典。下面的 MWE 可以工作，但是对于大的子集选择非常慢。

import numpy as np
import pandas as pd

# Full DataFrame
y_max = 50  
Y_max = range(1, y_max+1)

t_max = 100 
T_max = range(1, t_max+1)

idx_max = tuple((y,t) for y in Y_max for t in T_max) 

df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])


# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)

t1 = 5
tN = 9
T = range(t1, tN+1)

idx_sub = tuple((y,t) for y in Y for t in T)

data_sub = df.loc[(Y,T), :]  #This is really slow

dict_sub = dict(zip(idx_sub, data_sub['Value']))

# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']

我曾考虑使用 df.loc[(y1,t1),(yN,tN), :]，但它无法正常工作，因为第二个索引仅在最后一年有界 yN。

Answer 1

一个想法是使用 Index.isin with itertools.product in boolean indexing:

from  itertools import product

idx_sub = tuple(product(Y, T))

dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Python 的 `.loc` 在选择数据子集时真的很慢

Python's `.loc` is really slow on selecting subsets of Data

python

select

dataframe

pandas