从 pandas 数据框中查找过滤后的 numpy 数组中的列标签

Find column labels in a filtered numpy array from pandas data frame

我有一个 pandas 数据框,我需要将其导出到 numpy 以执行一些操作。这些操作会导致删除某些列。我想将生成的 numpy 数组与我的原始数据框进行比较,以获得保留的列的标签。问题是,某些列可能不是唯一的...

一个例子:

  1. 创建示例数据框:
>>> loci_df = pd.DataFrame(columns=['SNP1','SNP2','SNP3','SNP4','SNP5','SNP6','SNP7','SNP8','SNP9','SNP10'],
...                        data=[[ 0., np.NaN,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
...                                [ np.NaN, np.NaN,  0.,  2.,  0.,  1.,  1.,  1.,  0., np.NaN],
...                                [ 0., np.NaN,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  2.,  1.,  0.,  1.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
...                                [ 0., np.NaN,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.],
...                                [ 0., np.NaN,  0.,  1.,  1., np.NaN,  1.,  1.,  0.,  1.]])
>>> loci_df
   SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9  SNP10
0   0.0   NaN   0.0   0.0   0.0   0.0   1.0   1.0   0.0    0.0
1   0.0   NaN   1.0   1.0   1.0   1.0   1.0   1.0   0.0    0.0
2   NaN   NaN   0.0   2.0   0.0   1.0   1.0   1.0   0.0    NaN
3   0.0   NaN   0.0   1.0   1.0   1.0   1.0   1.0   0.0    0.0
4   0.0   NaN   2.0   1.0   0.0   1.0   1.0   1.0   0.0    0.0
5   0.0   NaN   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
6   0.0   NaN   1.0   0.0   0.0   0.0   1.0   1.0   0.0    0.0
7   0.0   NaN   0.0   0.0   0.0   1.0   1.0   1.0   0.0    0.0
8   0.0   NaN   0.0   1.0   0.0   0.0   1.0   1.0   0.0    0.0
9   0.0   NaN   0.0   1.0   1.0   NaN   1.0   1.0   0.0    1.0

  1. 将其移动到一个 numpy 数组并执行一些操作 - 此处删除所有值都缺失的列,或者所有非缺失值都相等的列。
>>> loci = np.array(loci_df)
>>> m1 = np.isnan(loci)
>>> m2 = loci[0]==loci
>>> loci = loci[:,~(m1|m2).all(0)]
>>> loci
array([[ 0.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  2.,  0.,  1.,  1.,  1., nan],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 2.,  1.,  0.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  1.,  1., nan,  1.,  1.,  1.]])

我现在想得到的是在numpy中过滤后保留的原始数据框中的标签列表。

['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10']

注意:某些列可能不是唯一的,例如这里的 SNP7 和 SNP8 列具有相同的值 - 我想保留它们!但这意味着我使用列值作为字典键和列标签作为字典值的(不是最佳的)方法是行不通的...

我尝试将过滤后的数据读入一个新的数据框中,然后将原始数据与生成的数据进行比较,但我收到 KeyErrors 不足为奇:

>>> filtered=pd.DataFrame(data=loci)
>>> filtered
     0    1    2    3    4    5    6
0  0.0  0.0  0.0  0.0  1.0  1.0  0.0
1  1.0  1.0  1.0  1.0  1.0  1.0  0.0
2  0.0  2.0  0.0  1.0  1.0  1.0  NaN
3  0.0  1.0  1.0  1.0  1.0  1.0  0.0
4  2.0  1.0  0.0  1.0  1.0  1.0  0.0
5  0.0  0.0  0.0  1.0  0.0  0.0  0.0
6  1.0  0.0  0.0  0.0  1.0  1.0  0.0
7  0.0  0.0  0.0  1.0  1.0  1.0  0.0
8  0.0  1.0  0.0  0.0  1.0  1.0  0.0
9  0.0  1.0  1.0  NaN  1.0  1.0  1.0

loci_df.loc[:,np.all(loci_df.values==filtered.values, axis=0)]

Traceback (most recent call last):
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1388, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 624, in _get_label
    return self.obj._xs(label, axis=axis)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3529, in xs
    return self[key]
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False

有没有办法得到这个,或者我是否需要完全改变我的方法?

想法是通过链接两个条件来创建掩码并用于 DataFrame 构造函数中的过滤列名称:

loci = np.array(loci_df)
m1 = np.isnan(loci)
m2 = loci[0]==loci

mask = ~(m1|m2).all(0)
loci = loci[:,mask]

print (loci_df.columns[mask])
Index(['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10'], dtype='object')

filtered=pd.DataFrame(data=loci, columns=loci_df.columns[mask])
print (filtered)
   SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP10
0   0.0   0.0   0.0   0.0   1.0   1.0    0.0
1   1.0   1.0   1.0   1.0   1.0   1.0    0.0
2   0.0   2.0   0.0   1.0   1.0   1.0    NaN
3   0.0   1.0   1.0   1.0   1.0   1.0    0.0
4   2.0   1.0   0.0   1.0   1.0   1.0    0.0
5   0.0   0.0   0.0   1.0   0.0   0.0    0.0
6   1.0   0.0   0.0   0.0   1.0   1.0    0.0
7   0.0   0.0   0.0   1.0   1.0   1.0    0.0
8   0.0   1.0   0.0   0.0   1.0   1.0    0.0
9   0.0   1.0   1.0   NaN   1.0   1.0    1.0