如果不为空,则评估每个单元格和 return 列标题 pandas df
evaluate every cell and return column head if not null pandas df
我有 pandas.df 233 行 * 234 列,我需要评估每个单元格和 return 对应的列 header 如果不是 nan,到目前为止我写了以下内容:
#First get a list of all column names (except column 0):
col_list=[]
for column in df.columns[1:]:
col_list.append(column)
#Then I try to iterate through every cell and evaluate for Null
#Also a counter is initiated to take the next col_name from col_list
#when count reach 233
for index, row in df.iterrows():
count = 0
for x in row[1:]:
count = count+1
for col_name in col_list:
if count >= 233: break
elif str(x) != 'nan':
print col_name
代码并没有完全做到这一点,我需要更改什么才能让代码在 233 行后中断并转到下一个 col_name?
Example:
Col_1 Col_2 Col_3
1 nan 13 nan
2 10 nan nan
3 nan 2 5
4 nan nan 4
output:
1 Col_2
2 Col_1
3 Col_2
4 Col_3
5 Col_3
您可以使用 dropna :
df.dropna(axis=1).columns
轴:{0 或‘索引’,1 或‘列’}
如何:{‘任何’,‘全部’}
基本上是用dropna去掉null,axis=1是去掉列,how="any"去掉是至少有一个列是null,.columns得到剩下的header.
如果第一列是 index
stack
- it remove all NaN
s and then get values from second level of Multiindex
by reset_index
and selecting or by Series
constructor with Index.get_level_values
:
我想你需要
s = df.stack().reset_index()['level_1'].rename('a')
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
Name: a, dtype: object
或者:
s = pd.Series(df.stack().index.get_level_values(1))
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
dtype: object
如果需要输出为list
:
L = df.stack().index.get_level_values(1).tolist()
print (L)
['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3']
详情:
print (df.stack())
1 Col_2 13.0
2 Col_1 10.0
3 Col_2 2.0
Col_3 5.0
4 Col_3 4.0
dtype: float64
我会使用 jezrael 的堆栈解决方案。
但是,如果您对 Numpy
方式感兴趣,这通常更快。
In [4889]: np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
Out[4889]: array(['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3'], dtype=object)
时间
In [4913]: df.shape
Out[4913]: (100, 3)
In [4914]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
10000 loops, best of 3: 35.8 µs per loop
In [4915]: %timeit df.stack().index.get_level_values(1)
1000 loops, best of 3: 335 µs per loop
In [4905]: df.shape
Out[4905]: (100000, 3)
In [4907]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
100 loops, best of 3: 5.98 ms per loop
In [4908]: %timeit df.stack().index.get_level_values(1)
100 loops, best of 3: 11.7 ms per loop
根据您的需要(可读性、速度、可维护性等)进行选择
我有 pandas.df 233 行 * 234 列,我需要评估每个单元格和 return 对应的列 header 如果不是 nan,到目前为止我写了以下内容:
#First get a list of all column names (except column 0):
col_list=[]
for column in df.columns[1:]:
col_list.append(column)
#Then I try to iterate through every cell and evaluate for Null
#Also a counter is initiated to take the next col_name from col_list
#when count reach 233
for index, row in df.iterrows():
count = 0
for x in row[1:]:
count = count+1
for col_name in col_list:
if count >= 233: break
elif str(x) != 'nan':
print col_name
代码并没有完全做到这一点,我需要更改什么才能让代码在 233 行后中断并转到下一个 col_name?
Example:
Col_1 Col_2 Col_3
1 nan 13 nan
2 10 nan nan
3 nan 2 5
4 nan nan 4
output:
1 Col_2
2 Col_1
3 Col_2
4 Col_3
5 Col_3
您可以使用 dropna :
df.dropna(axis=1).columns
轴:{0 或‘索引’,1 或‘列’}
如何:{‘任何’,‘全部’}
基本上是用dropna去掉null,axis=1是去掉列,how="any"去掉是至少有一个列是null,.columns得到剩下的header.
如果第一列是 index
stack
- it remove all NaN
s and then get values from second level of Multiindex
by reset_index
and selecting or by Series
constructor with Index.get_level_values
:
s = df.stack().reset_index()['level_1'].rename('a')
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
Name: a, dtype: object
或者:
s = pd.Series(df.stack().index.get_level_values(1))
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
dtype: object
如果需要输出为list
:
L = df.stack().index.get_level_values(1).tolist()
print (L)
['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3']
详情:
print (df.stack())
1 Col_2 13.0
2 Col_1 10.0
3 Col_2 2.0
Col_3 5.0
4 Col_3 4.0
dtype: float64
我会使用 jezrael 的堆栈解决方案。
但是,如果您对 Numpy
方式感兴趣,这通常更快。
In [4889]: np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
Out[4889]: array(['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3'], dtype=object)
时间
In [4913]: df.shape
Out[4913]: (100, 3)
In [4914]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
10000 loops, best of 3: 35.8 µs per loop
In [4915]: %timeit df.stack().index.get_level_values(1)
1000 loops, best of 3: 335 µs per loop
In [4905]: df.shape
Out[4905]: (100000, 3)
In [4907]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
100 loops, best of 3: 5.98 ms per loop
In [4908]: %timeit df.stack().index.get_level_values(1)
100 loops, best of 3: 11.7 ms per loop
根据您的需要(可读性、速度、可维护性等)进行选择