如何将多维数据框压缩为单列?
How to compress a multi dimentional dataframe into a single column?
我有以下数据框:
0 1 2 3 4 5 6 7 8
0 Twitter (True 01/21/2015) None None None None None None None None
1 Google, Inc. (True 11/07/2016) None None None None None None None None
2 Microsoft, (True 07/01/2016) Facebook (True 11/01/2016) None None None None None None None
3 standard & poors, Inc. (True 11/08/2016) None None None None None None None None
8 apple (True 11/10/2016) apple (True 11/01/2016) None None None None None apple (True 11/01/2016) None
如何将上述数据帧压缩成一个数据帧?:
0
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True 11/01/2016)
3 standard & poors, Inc. (True 11/08/2016) \
8 apple (True 11/10/2016) \ apple (True 11/01/2016) \ apple (True 11/01/2016)
我试过:
df = df.iloc[:,0].join('\')
但是,我不明白如何添加分隔符。我应该如何压缩带有分隔符的数据框?
我认为你需要 replace
None
to NaN
and then remove NaN
by stack
, last groupby
和 apply
join
:
df = df.replace({None: np.nan, 'None': np.nan}).stack()
df = df.groupby(level=0).apply(' \ '.join)
print (df)
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True ...
3 standard & poors, Inc. (True 11/08/2016)
8 apple (True 11/10/2016) \ apple (True 11/01/20...
dtype: object
另一个列表理解的解决方案:
df = df.replace({None: np.nan, 'None': np.nan})
#python 3 use str, python 2 basestring
df = df.apply(lambda x : ' \ '.join([y for y in x if isinstance(y, str)]), axis=1)
print (df)
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True ...
3 standard & poors, Inc. (True 11/08/2016)
8 apple (True 11/10/2016) \ apple (True 11/01/20...
dtype: object
时间:
#[50000 rows x 9 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [43]: %timeit (df.replace({None: np.nan, 'None': np.nan}).apply(lambda x : ''.join([y for y in x if isinstance(y, str)]), axis=1))
1 loop, best of 3: 820 ms per loop
In [44]: %timeit (df.replace({None: np.nan, 'None': np.nan}).stack().groupby(level=0).apply(' \ '.join))
1 loop, best of 3: 4.62 s per loop
你可以试试这个(我用一个看起来不错的小数据框得到了以下输出):
df = pd.DataFrame({'0':['Twitter (True 01/21/2015)', 'Google, Inc. (True 11/07/2016)', ' Microsoft, (True 07/01/2016)'], '1':[None, None, 'Facebook (True 11/01/2016)'], '2':[None, None, None]})
df = df.replace({None: ' ', 'None': ' '})
df.astype(str).apply(lambda x: '\'.join(x), axis=1)
0 Twitter (True 01/21/2015)\ \
1 Google, Inc. (True 11/07/2016)\ \
2 Microsoft, (True 07/01/2016)\Facebook (True ...
dtype: object
我有以下数据框:
0 1 2 3 4 5 6 7 8
0 Twitter (True 01/21/2015) None None None None None None None None
1 Google, Inc. (True 11/07/2016) None None None None None None None None
2 Microsoft, (True 07/01/2016) Facebook (True 11/01/2016) None None None None None None None
3 standard & poors, Inc. (True 11/08/2016) None None None None None None None None
8 apple (True 11/10/2016) apple (True 11/01/2016) None None None None None apple (True 11/01/2016) None
如何将上述数据帧压缩成一个数据帧?:
0
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True 11/01/2016)
3 standard & poors, Inc. (True 11/08/2016) \
8 apple (True 11/10/2016) \ apple (True 11/01/2016) \ apple (True 11/01/2016)
我试过:
df = df.iloc[:,0].join('\')
但是,我不明白如何添加分隔符。我应该如何压缩带有分隔符的数据框?
我认为你需要 replace
None
to NaN
and then remove NaN
by stack
, last groupby
和 apply
join
:
df = df.replace({None: np.nan, 'None': np.nan}).stack()
df = df.groupby(level=0).apply(' \ '.join)
print (df)
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True ...
3 standard & poors, Inc. (True 11/08/2016)
8 apple (True 11/10/2016) \ apple (True 11/01/20...
dtype: object
另一个列表理解的解决方案:
df = df.replace({None: np.nan, 'None': np.nan})
#python 3 use str, python 2 basestring
df = df.apply(lambda x : ' \ '.join([y for y in x if isinstance(y, str)]), axis=1)
print (df)
0 Twitter (True 01/21/2015)
1 Google, Inc. (True 11/07/2016)
2 Microsoft, (True 07/01/2016) \ Facebook (True ...
3 standard & poors, Inc. (True 11/08/2016)
8 apple (True 11/10/2016) \ apple (True 11/01/20...
dtype: object
时间:
#[50000 rows x 9 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [43]: %timeit (df.replace({None: np.nan, 'None': np.nan}).apply(lambda x : ''.join([y for y in x if isinstance(y, str)]), axis=1))
1 loop, best of 3: 820 ms per loop
In [44]: %timeit (df.replace({None: np.nan, 'None': np.nan}).stack().groupby(level=0).apply(' \ '.join))
1 loop, best of 3: 4.62 s per loop
你可以试试这个(我用一个看起来不错的小数据框得到了以下输出):
df = pd.DataFrame({'0':['Twitter (True 01/21/2015)', 'Google, Inc. (True 11/07/2016)', ' Microsoft, (True 07/01/2016)'], '1':[None, None, 'Facebook (True 11/01/2016)'], '2':[None, None, None]})
df = df.replace({None: ' ', 'None': ' '})
df.astype(str).apply(lambda x: '\'.join(x), axis=1)
0 Twitter (True 01/21/2015)\ \
1 Google, Inc. (True 11/07/2016)\ \
2 Microsoft, (True 07/01/2016)\Facebook (True ...
dtype: object