按多列分组并将其更改为 dataFrame/array

groupby multi colums and change it to dataFrame/array

嗨,我有一个这样的数据框:

                       Value         day  hour  min
Time                                                         
2015-12-19 10:08:52     1805  2015-12-19    10    8
2015-12-19 10:09:52     1794  2015-12-19    10    9
2015-12-19 10:19:51     1796  2015-12-19    10   19
2015-12-19 10:20:51     1806  2015-12-19    10   20
2015-12-19 10:29:52     1802  2015-12-19    10   29
2015-12-19 10:30:52     1800  2015-12-19    10   30
2015-12-19 10:40:51     1804  2015-12-19    10   40
2015-12-19 10:41:51     1798  2015-12-19    10   41
2015-12-19 10:50:51     1790  2015-12-19    10   50
2015-12-19 10:51:52     1811  2015-12-19    10   51
2015-12-19 11:00:51     1803  2015-12-19    11    0
2015-12-19 11:01:52     1784  2015-12-19    11    1
                         ...    ...         ...   ...  ...
2016-07-15 17:30:13     1811  2016-07-15    17   30
2016-07-15 17:31:13     1787  2016-07-15    17   31
2016-07-15 17:41:13     1800  2016-07-15    17   41
2016-07-15 17:42:13     1795  2016-07-15    17   42

我想按天和小时对其进行分组,最后使其成为 "Value" 列的多维数组,例如:

基于天和小时的分组,我需要让每个小时像这样:

2015-12-19  10 [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  ]
2015-12-20  11 [1803, 1793, 1795, 1801, 1796, 1796, 1788, 180...  ]
...  
2016-07-15  17 [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]

最后,我希望我能有一个像这样的数据框:


Time_index  hour    value1 value2 value3 ........value20

2015-12-19  10    1805, 1794, 1796, 1806 ... 1804, 1791, 1788, 1812  
2015-12-20  11    1803, 1793, 1795, 1801 ... 1796, 1796, 1788, 1800 
...  
2016-07-15  17    1794, 1792, 1788, 1799 ... 1811, 1803, 1808, 1790


或像这样的数组:

[[1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  ],[1803, 1793, 1795, 1801, 1796, 1796, 1788, 180...  ]....[1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]]


我能够通过一个专栏作品获得 groupby:

grouped_0 = train_df.groupby(['day'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']

dataFrame 分组的 'grouped' 列的输出如下:

2015-12-19  [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  
2015-12-20  [1790, 1809, 1809, 1789, 1807, 1804, 1790, 179...  
2015-12-21  [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179...  
2015-12-22  [1815, 1812, 1798, 1808, 1802, 1788, 1808, 179...  
2015-12-23  [1803, 1800, 1799, 1803, 1802, 1804, 1788, 179...  
2015-12-24  [1803, 1795, 1801, 1798, 1799, 1802, 1799, 179...

然而,当我尝试这样做时:

grouped_0 = train_df.groupby(['day', 'hour'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']

它抛出了这个错误:

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 4036, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 3476, in aggregate
    return self._python_agg_general(arg, *args, **kwargs)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 848, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2180, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2215, in _aggregate_series_pure_python
    raise ValueError('Function does not reduce')
ValueError: Function does not reduce

我的 pandas 版本: pd.版本 '0.20.3'

是的,为此使用 agg 并不是最好的主意,因为除非结果是包含单个对象的容器,否则结果将被视为无效。

您可以为此使用 groupby + apply

g = df.groupby(['day', 'hour']).Value.apply(lambda x: x.values.tolist())
g

day         hour
2015-12-19  10      [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...
            11                                           [1803, 1784]
2016-07-15  17                               [1811, 1787, 1800, 1795]
Name: Value, dtype: object

如果您希望每个元素都在自己的列中,您可以这样做:

v = pd.DataFrame(g.values.tolist(), index=g.index)\
       .rename(columns=lambda x: 'value{}'.format(x + 1)).reset_index()

v 是您的最终结果。