Pandas,将聚合数据框转换为元组列表
Pandas, convert aggregated dataframe to list of tuples
我正在尝试从熊猫的 DataFrame
中获得 tuples
的 list
。我更习惯于其他 API,例如 apache-spark
,其中 DataFrame
有一个名为 collect
的方法,但我稍微搜索了一下,发现了 this approach。但是结果不是我所期望的,我认为这是因为 DataFrame
有聚合数据。有什么简单的方法可以做到这一点吗?
让我展示我的问题:
print(df)
#date user Cost
#2016-10-01 xxxx 0.598111
# yyyy 0.598150
# zzzz 13.537223
#2016-10-02 xxxx 0.624247
# yyyy 0.624302
# zzzz 14.651441
print(df.values)
#[[ 0.59811124]
# [ 0.59814985]
# [ 13.53722286]
# [ 0.62424731]
# [ 0.62430216]
# [ 14.65144134]]
#I was expecting something like this:
[("2016-10-01", "xxxx", 0.598111),
("2016-10-01", "yyyy", 0.598150),
("2016-10-01", "zzzz", 13.537223)
("2016-10-02", "xxxx", 0.624247),
("2016-10-02", "yyyy", 0.624302),
("2016-10-02", "zzzz", 14.651441)]
编辑
我尝试了@Dervin 的建议,但结果并不令人满意。
collected = [for tuple(x) in df.values]
collected
[(0.59811124000000004,), (0.59814985000000032,), (13.53722285999994,),
(0.62424731000000044,), (0.62430216000000027,), (14.651441339999931,),
(0.62414758000000026,), (0.62423407000000042,), (14.655454959999938,)]
那是您得到的分层索引,因此您可以先执行此 SO question 中的操作,然后执行类似 [tuple(x) for x in df1.to_records(index=False)]
的操作。例如:
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
In [12]: df1
Out[12]:
a b c d
0 0.076626 -0.761338 0.150755 -0.428466
1 0.956445 0.769947 -1.433933 1.034086
2 -0.211886 -1.324807 -0.736709 -0.767971
...
In [13]: [tuple(x) for x in df1.to_records(index=False)]
Out[13]:
[(0.076625682946709128,
-0.76133754774190276,
0.15075466312259322,
-0.42846644471544015),
(0.95644517961731257,
0.76994677126920497,
-1.4339326896803839,
1.0340857719122247),
(-0.21188555188408928,
-1.3248066626301633,
-0.73670886051415208,
-0.76797061516159393),
...
我正在尝试从熊猫的 DataFrame
中获得 tuples
的 list
。我更习惯于其他 API,例如 apache-spark
,其中 DataFrame
有一个名为 collect
的方法,但我稍微搜索了一下,发现了 this approach。但是结果不是我所期望的,我认为这是因为 DataFrame
有聚合数据。有什么简单的方法可以做到这一点吗?
让我展示我的问题:
print(df)
#date user Cost
#2016-10-01 xxxx 0.598111
# yyyy 0.598150
# zzzz 13.537223
#2016-10-02 xxxx 0.624247
# yyyy 0.624302
# zzzz 14.651441
print(df.values)
#[[ 0.59811124]
# [ 0.59814985]
# [ 13.53722286]
# [ 0.62424731]
# [ 0.62430216]
# [ 14.65144134]]
#I was expecting something like this:
[("2016-10-01", "xxxx", 0.598111),
("2016-10-01", "yyyy", 0.598150),
("2016-10-01", "zzzz", 13.537223)
("2016-10-02", "xxxx", 0.624247),
("2016-10-02", "yyyy", 0.624302),
("2016-10-02", "zzzz", 14.651441)]
编辑
我尝试了@Dervin 的建议,但结果并不令人满意。
collected = [for tuple(x) in df.values]
collected
[(0.59811124000000004,), (0.59814985000000032,), (13.53722285999994,),
(0.62424731000000044,), (0.62430216000000027,), (14.651441339999931,),
(0.62414758000000026,), (0.62423407000000042,), (14.655454959999938,)]
那是您得到的分层索引,因此您可以先执行此 SO question 中的操作,然后执行类似 [tuple(x) for x in df1.to_records(index=False)]
的操作。例如:
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
In [12]: df1
Out[12]:
a b c d
0 0.076626 -0.761338 0.150755 -0.428466
1 0.956445 0.769947 -1.433933 1.034086
2 -0.211886 -1.324807 -0.736709 -0.767971
...
In [13]: [tuple(x) for x in df1.to_records(index=False)]
Out[13]:
[(0.076625682946709128,
-0.76133754774190276,
0.15075466312259322,
-0.42846644471544015),
(0.95644517961731257,
0.76994677126920497,
-1.4339326896803839,
1.0340857719122247),
(-0.21188555188408928,
-1.3248066626301633,
-0.73670886051415208,
-0.76797061516159393),
...