我如何 'squash' DataFrame 中的值,我知道每行只有一个项目进入一个系列?

How do I 'squash' the values in a DataFrame that I know only has one item per row into a Series?

我有一个 DataFrame,我已确认每一行中的值不超过一个(其余为 np.nan)。我怎样才能把它变成一维数组或系列?

假设这是我的起始数组:

In [7]: import pandas as pd

In [8]: data = [
    [np.nan, 9.0, np.nan],
    [np.nan, np.nan, 3.0],
    [np.nan, np.nan, 5.0],
    [np.nan, np.nan, np.nan],
    [1.0, np.nan, np.nan]
]

In [9]: a = pd.DataFrame(data)

In [10]: a
Out[10]: 
     0    1    2
0  NaN  9.0  NaN
1  NaN  NaN  3.0
2  NaN  NaN  5.0
3  NaN  NaN  NaN
4  1.0  NaN  NaN

我想创建以下系列 b:

In [17]: b
Out[17]: 
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

我已经写了一些代码来做到这一点:

In [14]: m = a.notnull()

In [15]: m
Out[15]: 
       0      1      2
0  False   True  False
1  False  False   True
2  False  False   True
3  False  False  False
4   True  False  False

In [16]: for i, row in a.iterrows():
        for j, v in row.iteritems():
                if m.iloc[i, j]:
                        b[i] = v

但一定有更简单的方法!

我尝试使用 np.maxnp.sum,但它们都是 return 一个空 (nan) 数组。

可以使用first_valid_index,但需要条件如果所有值都是NaN:

def f(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]

b = a.apply(f, axis=1)

print (b)
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

sum and numpy.where的另一个解决方案:

print (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

np.max 的解决方案也很好用:

print (np.max(a, axis=1))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

或者更简单和最快的 max:

print (a.max(axis=1))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

时间:

a = pd.concat([a]*10000).reset_index(drop=True)

In [133]: %timeit (a.max(axis=1))
100 loops, best of 3: 2.81 ms per loop

In [134]: %timeit (np.max(a, axis=1))
100 loops, best of 3: 2.83 ms per loop

In [135]: %timeit (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
100 loops, best of 3: 3.18 ms per loop

In [136]: %timeit (a.apply(f, axis=1))
1 loop, best of 3: 2.18 s per loop

#
In [137]: %timeit a.max(axis=1, skipna=True)
100 loops, best of 3: 2.84 ms per loop

def user(dataDF):

    squash = pd.Series(index=dataDF.index)
    for col in dataDF.columns.values:
        squash.update(dataDF[col])
    return squash

print(user(a))
In [151]: %timeit (user(a))
100 loops, best of 3: 7.75 ms per loop

通过评论编辑:

如果值不是数字,您可以使用:

import pandas as pd
import numpy as np

data = [
    [np.nan, 'a', np.nan],
    [np.nan, np.nan, 'b'],
    [np.nan, np.nan, 'c'],
    [np.nan, np.nan, np.nan],
    ['d', np.nan, np.nan]
]

a = pd.DataFrame(data)
print (a)
     0    1    2
0  NaN    a  NaN
1  NaN  NaN    b
2  NaN  NaN    c
3  NaN  NaN  NaN
4    d  NaN  NaN

print (a.fillna('').sum(axis=1).mask(a.isnull().all(1)))
0      a
1      b
2      c
3    NaN
4      d
dtype: object

您可以使用 pd.DataFrame.max or pd.DataFrame.sum 并将 skipna 值设置为 True:

skipna: boolean, default True; Exclude NA/null values. If an entire row/column is NA, the result will be NA

所以,你应该试试

a.max(axis=1, skipna=True)

我会使用 update() 函数。对于可变数量的列:

dataDF = pd.DataFrame(data)

squash = pd.Series(index=dataDF.index)
for col in dataDF.columns.values:
    squash.update(dataDF[col])

print (squash)

0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
Name: 0, dtype: float64