将 Pandas 数据框从基于行的转换为基于列的
Convert Pandas Dataframe from Row based to Columnar
我的数据框 (df) 如下所示:
Date FieldA ValueA ValueB
09-02-2016 TypeA 3 5
09-02-2016 TypeB 6 7
我希望数据框如下所示:
Date TypeA_ValueA TypeA_ValueB TypeB_ValueA TypeB_ValueB
09-02-2016 3 5 6 7
我厌倦了 pandas 中的 df.pivot,我可以在其中提供单个值列。它不需要超过一个。当我提供多个时,我会遇到异常。 pandas_pivot
Exception: Data must be 1-dimensional
df1 = df.set_index(['Date', 'FieldA']).unstack()
df1.columns = df1.columns.map('_'.join)
df1.reset_index()
设置参考
from StringIO import StringIO
import pandas as pd
text = """Date FieldA ValueA ValueB
09-02-2016 TypeA 3 5
09-02-2016 TypeB 6 7"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
df
In [36]: df
Out[36]:
Date FieldA ValueA ValueB
0 2016-09-02 TypeA 3 5
1 2016-09-02 TypeB 6 7
2 2016-09-03 TypeA 4 8
3 2016-09-03 TypeB 3 9
In [37]: v_cols = df.columns.difference(['FieldA', 'Date'])
In [38]: def func(x):
...: d = {'_'.join([t, c]): x[x['FieldA'] == t][c].iloc[0] for t in x.FieldA for c in v_cols}
...: for k, v in d.iteritems():
...: x[k] = v
...: return x
...:
In [39]: newdf = df.groupby('Date').apply(func)
In [40]: newdf.drop(v_cols.tolist() + ['FieldA'], axis=1).drop_duplicates()
Out[340]:
Date TypeA_ValueA TypeA_ValueB TypeB_ValueA TypeB_ValueB
0 2016-09-02 3 5 6 7
2 2016-09-03 4 8 3 9
使用pd.pivot_table
.
In [1]: pd.pivot_table(df, index='Date', columns='FieldA', values=['ValueA', 'ValueB'])
Out[1]:
ValueA ValueB
FieldA TypeA TypeB TypeA TypeB
Date
09-02-2016 3 6 5 7
因此,您将获得一个带有 MultiIndex 的 DataFrame。如果你想展平它并在列名中使用 _
作为分隔符,你可以这样做:
In [1]: df = pd.pivot_table(df, index='Date', columns='FieldA', values=['ValueA', 'ValueB'])
In [2]: df.columns = [ '{}_{}'.format(cat, val) for val, cat in df.columns ]
In [3]: df
Out[3]:
TypeA_ValueA TypeB_ValueA TypeA_ValueB TypeB_ValueB
Date
09-02-2016 3 6 5 7
我的数据框 (df) 如下所示:
Date FieldA ValueA ValueB
09-02-2016 TypeA 3 5
09-02-2016 TypeB 6 7
我希望数据框如下所示:
Date TypeA_ValueA TypeA_ValueB TypeB_ValueA TypeB_ValueB
09-02-2016 3 5 6 7
我厌倦了 pandas 中的 df.pivot,我可以在其中提供单个值列。它不需要超过一个。当我提供多个时,我会遇到异常。 pandas_pivot
Exception: Data must be 1-dimensional
df1 = df.set_index(['Date', 'FieldA']).unstack()
df1.columns = df1.columns.map('_'.join)
df1.reset_index()
设置参考
from StringIO import StringIO
import pandas as pd
text = """Date FieldA ValueA ValueB
09-02-2016 TypeA 3 5
09-02-2016 TypeB 6 7"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
df
In [36]: df
Out[36]:
Date FieldA ValueA ValueB
0 2016-09-02 TypeA 3 5
1 2016-09-02 TypeB 6 7
2 2016-09-03 TypeA 4 8
3 2016-09-03 TypeB 3 9
In [37]: v_cols = df.columns.difference(['FieldA', 'Date'])
In [38]: def func(x):
...: d = {'_'.join([t, c]): x[x['FieldA'] == t][c].iloc[0] for t in x.FieldA for c in v_cols}
...: for k, v in d.iteritems():
...: x[k] = v
...: return x
...:
In [39]: newdf = df.groupby('Date').apply(func)
In [40]: newdf.drop(v_cols.tolist() + ['FieldA'], axis=1).drop_duplicates()
Out[340]:
Date TypeA_ValueA TypeA_ValueB TypeB_ValueA TypeB_ValueB
0 2016-09-02 3 5 6 7
2 2016-09-03 4 8 3 9
使用pd.pivot_table
.
In [1]: pd.pivot_table(df, index='Date', columns='FieldA', values=['ValueA', 'ValueB'])
Out[1]:
ValueA ValueB
FieldA TypeA TypeB TypeA TypeB
Date
09-02-2016 3 6 5 7
因此,您将获得一个带有 MultiIndex 的 DataFrame。如果你想展平它并在列名中使用 _
作为分隔符,你可以这样做:
In [1]: df = pd.pivot_table(df, index='Date', columns='FieldA', values=['ValueA', 'ValueB'])
In [2]: df.columns = [ '{}_{}'.format(cat, val) for val, cat in df.columns ]
In [3]: df
Out[3]:
TypeA_ValueA TypeB_ValueA TypeA_ValueB TypeB_ValueB
Date
09-02-2016 3 6 5 7