如何使用将每个元素从嵌套列表转换为数据框?
How to use convert every element from a nested list to a dataframe?
my_df = pd.DataFrame({'ID':['12345','23456','34567'],
'Info':[[['Rob Kardashian', '00052369', '1987-03-17', 'Reality Star'], ['Brooke Barry', '00213658', '2001-03-30', 'TikTok Star']],
[['Bae De Leon', '00896351', '1997-08-02', 'Volleyball Player'],['Jonas Blue', '02369785', '1990-08-02', 'Music Producer'],['Albert Einstein', '65231478', '1879-03-14','Scientist']],
[['Robert Downey Jr', '23897410', '1965-04-04', 'Actor'],['Stan Lee','35239856','1922-12-28','Publisher & Producer']]]})
大家好,我有上面的数据框,想将列 'Info' 中的元素转换为行。
我试过了
[[pd.DataFrame(i) for i in k] for k in my_df ['Info'].tolist()]
但是输出不是我预期的。
预期产出:
在此先感谢您的帮助!
这是你想要的吗:
my_df = my_df.set_index('ID')
pd.DataFrame(np.concatenate(my_df.Info), \
index=my_df.index.repeat(my_df.Info.str.len()))
Out[1102]:
0 1 2 3
ID
12345 Rob Kardashian 00052369 1987-03-17 Reality Star
12345 Brooke Barry 00213658 2001-03-30 TikTok Star
23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
23456 Jonas Blue 02369785 1990-08-02 Music Producer
23456 Albert Einstein 65231478 1879-03-14 Scientist
34567 Robert Downey Jr 23897410 1965-04-04 Actor
34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
注意:我将ID
作为输出df
的索引。如果您需要它作为一列,请按如下方式链接额外的 .reset_index
:
pd.DataFrame(np.concatenate(my_df.Info), \
index=my_df.index.repeat(my_df.Info.str.len())).reset_index()
Out[1106]:
ID 0 1 2 3
0 12345 Rob Kardashian 00052369 1987-03-17 Reality Star
1 12345 Brooke Barry 00213658 2001-03-30 TikTok Star
2 23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
3 23456 Jonas Blue 02369785 1990-08-02 Music Producer
4 23456 Albert Einstein 65231478 1879-03-14 Scientist
5 34567 Robert Downey Jr 23897410 1965-04-04 Actor
6 34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
您可以使用分组:
my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
这会为您汇总返回的数据帧:
>>> my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
0 1 2 3
ID
12345 0 Rob Kardashian 00052369 1987-03-17 Reality Star
1 Brooke Barry 00213658 2001-03-30 TikTok Star
23456 0 Bae De Leon 00896351 1997-08-02 Volleyball Player
1 Jonas Blue 02369785 1990-08-02 Music Producer
2 Albert Einstein 65231478 1879-03-14 Scientist
34567 0 Robert Downey Jr 23897410 1965-04-04 Actor
1 Stan Lee 35239856 1922-12-28 Publisher & Producer
然后您可以选择重置索引并删除 level_1
列:
expanded = my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
expanded.reset_index().drop("level_1", axis=1)
这给了你
ID 0 1 2 3
0 12345 Rob Kardashian 00052369 1987-03-17 Reality Star
1 12345 Brooke Barry 00213658 2001-03-30 TikTok Star
2 23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
3 23456 Jonas Blue 02369785 1990-08-02 Music Producer
4 23456 Albert Einstein 65231478 1879-03-14 Scientist
5 34567 Robert Downey Jr 23897410 1965-04-04 Actor
6 34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
因为这使用 GroupBy.apply()
,但是我不希望它那么快。
将 Andy 和我的版本封装到 运行 时间试验的函数中确实表明使用我的版本将是较慢的选择:
In [99]: def np_concat(df):
...: df = df.set_index('ID')
...: pd.DataFrame(np.concatenate(my_df.Info), index=my_df.index.repeat(my_df.Info.str.len()))
...:
In [100]: def groupby(df):
...: df = df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
...: df.reset_index().drop("level_1", axis=1)
...:
In [101]: %timeit np_concat(my_df)
1.08 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [102]: %timeit groupby(my_df)
6.33 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
my_df = pd.DataFrame({'ID':['12345','23456','34567'],
'Info':[[['Rob Kardashian', '00052369', '1987-03-17', 'Reality Star'], ['Brooke Barry', '00213658', '2001-03-30', 'TikTok Star']],
[['Bae De Leon', '00896351', '1997-08-02', 'Volleyball Player'],['Jonas Blue', '02369785', '1990-08-02', 'Music Producer'],['Albert Einstein', '65231478', '1879-03-14','Scientist']],
[['Robert Downey Jr', '23897410', '1965-04-04', 'Actor'],['Stan Lee','35239856','1922-12-28','Publisher & Producer']]]})
大家好,我有上面的数据框,想将列 'Info' 中的元素转换为行。 我试过了
[[pd.DataFrame(i) for i in k] for k in my_df ['Info'].tolist()]
但是输出不是我预期的。
预期产出:
在此先感谢您的帮助!
这是你想要的吗:
my_df = my_df.set_index('ID')
pd.DataFrame(np.concatenate(my_df.Info), \
index=my_df.index.repeat(my_df.Info.str.len()))
Out[1102]:
0 1 2 3
ID
12345 Rob Kardashian 00052369 1987-03-17 Reality Star
12345 Brooke Barry 00213658 2001-03-30 TikTok Star
23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
23456 Jonas Blue 02369785 1990-08-02 Music Producer
23456 Albert Einstein 65231478 1879-03-14 Scientist
34567 Robert Downey Jr 23897410 1965-04-04 Actor
34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
注意:我将ID
作为输出df
的索引。如果您需要它作为一列,请按如下方式链接额外的 .reset_index
:
pd.DataFrame(np.concatenate(my_df.Info), \
index=my_df.index.repeat(my_df.Info.str.len())).reset_index()
Out[1106]:
ID 0 1 2 3
0 12345 Rob Kardashian 00052369 1987-03-17 Reality Star
1 12345 Brooke Barry 00213658 2001-03-30 TikTok Star
2 23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
3 23456 Jonas Blue 02369785 1990-08-02 Music Producer
4 23456 Albert Einstein 65231478 1879-03-14 Scientist
5 34567 Robert Downey Jr 23897410 1965-04-04 Actor
6 34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
您可以使用分组:
my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
这会为您汇总返回的数据帧:
>>> my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
0 1 2 3
ID
12345 0 Rob Kardashian 00052369 1987-03-17 Reality Star
1 Brooke Barry 00213658 2001-03-30 TikTok Star
23456 0 Bae De Leon 00896351 1997-08-02 Volleyball Player
1 Jonas Blue 02369785 1990-08-02 Music Producer
2 Albert Einstein 65231478 1879-03-14 Scientist
34567 0 Robert Downey Jr 23897410 1965-04-04 Actor
1 Stan Lee 35239856 1922-12-28 Publisher & Producer
然后您可以选择重置索引并删除 level_1
列:
expanded = my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
expanded.reset_index().drop("level_1", axis=1)
这给了你
ID 0 1 2 3
0 12345 Rob Kardashian 00052369 1987-03-17 Reality Star
1 12345 Brooke Barry 00213658 2001-03-30 TikTok Star
2 23456 Bae De Leon 00896351 1997-08-02 Volleyball Player
3 23456 Jonas Blue 02369785 1990-08-02 Music Producer
4 23456 Albert Einstein 65231478 1879-03-14 Scientist
5 34567 Robert Downey Jr 23897410 1965-04-04 Actor
6 34567 Stan Lee 35239856 1922-12-28 Publisher & Producer
因为这使用 GroupBy.apply()
,但是我不希望它那么快。
将 Andy 和我的版本封装到 运行 时间试验的函数中确实表明使用我的版本将是较慢的选择:
In [99]: def np_concat(df):
...: df = df.set_index('ID')
...: pd.DataFrame(np.concatenate(my_df.Info), index=my_df.index.repeat(my_df.Info.str.len()))
...:
In [100]: def groupby(df):
...: df = df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
...: df.reset_index().drop("level_1", axis=1)
...:
In [101]: %timeit np_concat(my_df)
1.08 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [102]: %timeit groupby(my_df)
6.33 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)