使用重复索引操作多索引数据帧
Manipulating multi-index dataframe with repeating index
让我们有这样一个结构:
from datetime import date, timedelta
from pandas import DataFrame as df
import numpy as np
idx1 = []
idx2 = []
idx3 = []
for i in range(3):
idx1.append(date.today() - timedelta(days=0))
idx2.append(date.today() - timedelta(days=1))
idx3.append(date.today() - timedelta(days=2))
data1 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data2 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data3 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
my_dataframe_1 = df.from_dict(data1, orient="index", columns=["pretty", "brave"])
my_dataframe_2 = df.from_dict(data2, orient="index", columns=["pretty", "brave"])
my_dataframe_3 = df.from_dict(data3, orient="index", columns=["pretty", "brave"])
my_dataframe_1["timestamp"] = idx1
my_dataframe_2["timestamp"] = idx2
my_dataframe_3["timestamp"] = idx3
ultimate_df = my_dataframe_1.append(my_dataframe_2.append(my_dataframe_3))
ultimate_df.sort_values(["timestamp", "pretty"], ascending=[True, False], inplace=True)
ultimate_df.reset_index(inplace=True)
ultimate_df.set_index(["timestamp", "timestamp"], inplace=True)
print(ultimate_df)
这给了我们:
index pretty brave
timestamp timestamp
2022-02-28 2022-02-28 Furby 6.083493 8.383633
2022-02-28 Sasquach 3.454873 6.426673
2022-02-28 Ant 1.279582 9.647796
2022-03-01 2022-03-01 Furby 9.667125 3.462951
2022-03-01 Ant 3.443364 5.457242
2022-03-01 Sasquach 3.364245 5.190403
2022-03-02 2022-03-02 Ant 2.773309 4.708483
2022-03-02 Furby 2.765552 2.065672
2022-03-02 Sasquach 2.347767 7.956183
我的问题是,是否有任何简单的方法来处理这种结构化数据,其中索引对“索引”列中的每个项目重复?我的目标是轻松地 select 从“索引”列中找到最近日期具有最高漂亮值的项目(我认为 ultimate_df.iloc[-1].iloc[0] 是可能的,但是不是这样的。这里不需要多索引,这是我尝试使用的方法,但是当我这样切片时:
print(ultimate_df.iloc[-1])
结果如下,是“内”索引行
index Sasquach
pretty 2.347767
brave 7.956183
Name: (2022-03-02, 2022-03-02), dtype: object
你有什么建议吗?也许在执行 .iloc[-1] 时可能只有那个外部索引(有点分组,不会对所有项目重复)会打印这 3 个项目的最新时间戳?我真的不喜欢 df.groupby() 并且我不想为“索引”列中的每个项目创建单独的数据框。我唯一的想法是使用这样的多索引和切片:
print(ultimate_df.loc[ultimate_df.iloc[-1].name]) (this print just for your convinience)
print(ultimate_df.loc[ultimate_df.iloc[-1].name].iloc[0])
收到最近timestamp/date“漂亮”值最高的物品,但看起来很复杂。
结果:
index pretty brave
timestamp timestamp
2022-03-02 2022-03-02 Ant 2.773309 4.708483
2022-03-02 Furby 2.765552 2.065672
2022-03-02 Sasquach 2.347767 7.956183
index Ant
pretty 2.773309
brave 4.708483
Name: (2022-03-02, 2022-03-02), dtype: object
编辑:对于子孙后代,一些替代方法是创建一个数据框字典来帮助操作具有重复项的此类数据:
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
正在尝试重置 MultiIndex 和过滤器以获取所需的行:
df = ultimate_df.droplevel(0).reset_index()
>>> df.loc[df[df["timestamp"].eq(df["timestamp"].max())]["pretty"].idxmax()]
timestamp 2022-03-02
index Furby
pretty 8.162101
brave 1.038208
Name: 6, dtype: object
让我们有这样一个结构:
from datetime import date, timedelta
from pandas import DataFrame as df
import numpy as np
idx1 = []
idx2 = []
idx3 = []
for i in range(3):
idx1.append(date.today() - timedelta(days=0))
idx2.append(date.today() - timedelta(days=1))
idx3.append(date.today() - timedelta(days=2))
data1 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data2 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data3 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
my_dataframe_1 = df.from_dict(data1, orient="index", columns=["pretty", "brave"])
my_dataframe_2 = df.from_dict(data2, orient="index", columns=["pretty", "brave"])
my_dataframe_3 = df.from_dict(data3, orient="index", columns=["pretty", "brave"])
my_dataframe_1["timestamp"] = idx1
my_dataframe_2["timestamp"] = idx2
my_dataframe_3["timestamp"] = idx3
ultimate_df = my_dataframe_1.append(my_dataframe_2.append(my_dataframe_3))
ultimate_df.sort_values(["timestamp", "pretty"], ascending=[True, False], inplace=True)
ultimate_df.reset_index(inplace=True)
ultimate_df.set_index(["timestamp", "timestamp"], inplace=True)
print(ultimate_df)
这给了我们:
index pretty brave
timestamp timestamp
2022-02-28 2022-02-28 Furby 6.083493 8.383633
2022-02-28 Sasquach 3.454873 6.426673
2022-02-28 Ant 1.279582 9.647796
2022-03-01 2022-03-01 Furby 9.667125 3.462951
2022-03-01 Ant 3.443364 5.457242
2022-03-01 Sasquach 3.364245 5.190403
2022-03-02 2022-03-02 Ant 2.773309 4.708483
2022-03-02 Furby 2.765552 2.065672
2022-03-02 Sasquach 2.347767 7.956183
我的问题是,是否有任何简单的方法来处理这种结构化数据,其中索引对“索引”列中的每个项目重复?我的目标是轻松地 select 从“索引”列中找到最近日期具有最高漂亮值的项目(我认为 ultimate_df.iloc[-1].iloc[0] 是可能的,但是不是这样的。这里不需要多索引,这是我尝试使用的方法,但是当我这样切片时:
print(ultimate_df.iloc[-1])
结果如下,是“内”索引行
index Sasquach
pretty 2.347767
brave 7.956183
Name: (2022-03-02, 2022-03-02), dtype: object
你有什么建议吗?也许在执行 .iloc[-1] 时可能只有那个外部索引(有点分组,不会对所有项目重复)会打印这 3 个项目的最新时间戳?我真的不喜欢 df.groupby() 并且我不想为“索引”列中的每个项目创建单独的数据框。我唯一的想法是使用这样的多索引和切片:
print(ultimate_df.loc[ultimate_df.iloc[-1].name]) (this print just for your convinience)
print(ultimate_df.loc[ultimate_df.iloc[-1].name].iloc[0])
收到最近timestamp/date“漂亮”值最高的物品,但看起来很复杂。
结果:
index pretty brave
timestamp timestamp
2022-03-02 2022-03-02 Ant 2.773309 4.708483
2022-03-02 Furby 2.765552 2.065672
2022-03-02 Sasquach 2.347767 7.956183
index Ant
pretty 2.773309
brave 4.708483
Name: (2022-03-02, 2022-03-02), dtype: object
编辑:对于子孙后代,一些替代方法是创建一个数据框字典来帮助操作具有重复项的此类数据:
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
正在尝试重置 MultiIndex 和过滤器以获取所需的行:
df = ultimate_df.droplevel(0).reset_index()
>>> df.loc[df[df["timestamp"].eq(df["timestamp"].max())]["pretty"].idxmax()]
timestamp 2022-03-02
index Furby
pretty 8.162101
brave 1.038208
Name: 6, dtype: object