从 pandas 数据框中查找过滤后的 numpy 数组中的列标签
Find column labels in a filtered numpy array from pandas data frame
我有一个 pandas 数据框,我需要将其导出到 numpy 以执行一些操作。这些操作会导致删除某些列。我想将生成的 numpy 数组与我的原始数据框进行比较,以获得保留的列的标签。问题是,某些列可能不是唯一的...
一个例子:
- 创建示例数据框:
>>> loci_df = pd.DataFrame(columns=['SNP1','SNP2','SNP3','SNP4','SNP5','SNP6','SNP7','SNP8','SNP9','SNP10'],
... data=[[ 0., np.NaN, 0., 0., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 1., 1., 1., 1., 1., 1., 0., 0.],
... [ np.NaN, np.NaN, 0., 2., 0., 1., 1., 1., 0., np.NaN],
... [ 0., np.NaN, 0., 1., 1., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 2., 1., 0., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 0., 0., 1., 0., 0., 0., 0.],
... [ 0., np.NaN, 1., 0., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 0., 0., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 1., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 1., 1., np.NaN, 1., 1., 0., 1.]])
>>> loci_df
SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
0 0.0 NaN 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
1 0.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0
2 NaN NaN 0.0 2.0 0.0 1.0 1.0 1.0 0.0 NaN
3 0.0 NaN 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0
4 0.0 NaN 2.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0
5 0.0 NaN 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
6 0.0 NaN 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
7 0.0 NaN 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0
8 0.0 NaN 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
9 0.0 NaN 0.0 1.0 1.0 NaN 1.0 1.0 0.0 1.0
- 将其移动到一个 numpy 数组并执行一些操作 - 此处删除所有值都缺失的列,或者所有非缺失值都相等的列。
>>> loci = np.array(loci_df)
>>> m1 = np.isnan(loci)
>>> m2 = loci[0]==loci
>>> loci = loci[:,~(m1|m2).all(0)]
>>> loci
array([[ 0., 0., 0., 0., 1., 1., 0.],
[ 1., 1., 1., 1., 1., 1., 0.],
[ 0., 2., 0., 1., 1., 1., nan],
[ 0., 1., 1., 1., 1., 1., 0.],
[ 2., 1., 0., 1., 1., 1., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 1., 1., 1., 0.],
[ 0., 1., 0., 0., 1., 1., 0.],
[ 0., 1., 1., nan, 1., 1., 1.]])
我现在想得到的是在numpy中过滤后保留的原始数据框中的标签列表。
['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10']
注意:某些列可能不是唯一的,例如这里的 SNP7 和 SNP8 列具有相同的值 - 我想保留它们!但这意味着我使用列值作为字典键和列标签作为字典值的(不是最佳的)方法是行不通的...
我尝试将过滤后的数据读入一个新的数据框中,然后将原始数据与生成的数据进行比较,但我收到 KeyErrors 不足为奇:
>>> filtered=pd.DataFrame(data=loci)
>>> filtered
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 0.0 2.0 0.0 1.0 1.0 1.0 NaN
3 0.0 1.0 1.0 1.0 1.0 1.0 0.0
4 2.0 1.0 0.0 1.0 1.0 1.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 1.0 1.0 0.0
7 0.0 0.0 0.0 1.0 1.0 1.0 0.0
8 0.0 1.0 0.0 0.0 1.0 1.0 0.0
9 0.0 1.0 1.0 NaN 1.0 1.0 1.0
loci_df.loc[:,np.all(loci_df.values==filtered.values, axis=0)]
Traceback (most recent call last):
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
return self._getitem_tuple(key)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1388, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
return self._get_label(key, axis=axis)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 624, in _get_label
return self.obj._xs(label, axis=axis)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3529, in xs
return self[key]
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
有没有办法得到这个,或者我是否需要完全改变我的方法?
想法是通过链接两个条件来创建掩码并用于 DataFrame
构造函数中的过滤列名称:
loci = np.array(loci_df)
m1 = np.isnan(loci)
m2 = loci[0]==loci
mask = ~(m1|m2).all(0)
loci = loci[:,mask]
print (loci_df.columns[mask])
Index(['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10'], dtype='object')
filtered=pd.DataFrame(data=loci, columns=loci_df.columns[mask])
print (filtered)
SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP10
0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 0.0 2.0 0.0 1.0 1.0 1.0 NaN
3 0.0 1.0 1.0 1.0 1.0 1.0 0.0
4 2.0 1.0 0.0 1.0 1.0 1.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 1.0 1.0 0.0
7 0.0 0.0 0.0 1.0 1.0 1.0 0.0
8 0.0 1.0 0.0 0.0 1.0 1.0 0.0
9 0.0 1.0 1.0 NaN 1.0 1.0 1.0
我有一个 pandas 数据框,我需要将其导出到 numpy 以执行一些操作。这些操作会导致删除某些列。我想将生成的 numpy 数组与我的原始数据框进行比较,以获得保留的列的标签。问题是,某些列可能不是唯一的...
一个例子:
- 创建示例数据框:
>>> loci_df = pd.DataFrame(columns=['SNP1','SNP2','SNP3','SNP4','SNP5','SNP6','SNP7','SNP8','SNP9','SNP10'],
... data=[[ 0., np.NaN, 0., 0., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 1., 1., 1., 1., 1., 1., 0., 0.],
... [ np.NaN, np.NaN, 0., 2., 0., 1., 1., 1., 0., np.NaN],
... [ 0., np.NaN, 0., 1., 1., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 2., 1., 0., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 0., 0., 1., 0., 0., 0., 0.],
... [ 0., np.NaN, 1., 0., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 0., 0., 1., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 1., 0., 0., 1., 1., 0., 0.],
... [ 0., np.NaN, 0., 1., 1., np.NaN, 1., 1., 0., 1.]])
>>> loci_df
SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
0 0.0 NaN 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
1 0.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0
2 NaN NaN 0.0 2.0 0.0 1.0 1.0 1.0 0.0 NaN
3 0.0 NaN 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0
4 0.0 NaN 2.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0
5 0.0 NaN 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
6 0.0 NaN 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
7 0.0 NaN 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0
8 0.0 NaN 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
9 0.0 NaN 0.0 1.0 1.0 NaN 1.0 1.0 0.0 1.0
- 将其移动到一个 numpy 数组并执行一些操作 - 此处删除所有值都缺失的列,或者所有非缺失值都相等的列。
>>> loci = np.array(loci_df)
>>> m1 = np.isnan(loci)
>>> m2 = loci[0]==loci
>>> loci = loci[:,~(m1|m2).all(0)]
>>> loci
array([[ 0., 0., 0., 0., 1., 1., 0.],
[ 1., 1., 1., 1., 1., 1., 0.],
[ 0., 2., 0., 1., 1., 1., nan],
[ 0., 1., 1., 1., 1., 1., 0.],
[ 2., 1., 0., 1., 1., 1., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 1., 1., 1., 0.],
[ 0., 1., 0., 0., 1., 1., 0.],
[ 0., 1., 1., nan, 1., 1., 1.]])
我现在想得到的是在numpy中过滤后保留的原始数据框中的标签列表。
['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10']
注意:某些列可能不是唯一的,例如这里的 SNP7 和 SNP8 列具有相同的值 - 我想保留它们!但这意味着我使用列值作为字典键和列标签作为字典值的(不是最佳的)方法是行不通的...
我尝试将过滤后的数据读入一个新的数据框中,然后将原始数据与生成的数据进行比较,但我收到 KeyErrors 不足为奇:
>>> filtered=pd.DataFrame(data=loci)
>>> filtered
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 0.0 2.0 0.0 1.0 1.0 1.0 NaN
3 0.0 1.0 1.0 1.0 1.0 1.0 0.0
4 2.0 1.0 0.0 1.0 1.0 1.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 1.0 1.0 0.0
7 0.0 0.0 0.0 1.0 1.0 1.0 0.0
8 0.0 1.0 0.0 0.0 1.0 1.0 0.0
9 0.0 1.0 1.0 NaN 1.0 1.0 1.0
loci_df.loc[:,np.all(loci_df.values==filtered.values, axis=0)]
Traceback (most recent call last):
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
return self._getitem_tuple(key)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1388, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
return self._get_label(key, axis=axis)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 624, in _get_label
return self.obj._xs(label, axis=axis)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3529, in xs
return self[key]
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/jilska2/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
有没有办法得到这个,或者我是否需要完全改变我的方法?
想法是通过链接两个条件来创建掩码并用于 DataFrame
构造函数中的过滤列名称:
loci = np.array(loci_df)
m1 = np.isnan(loci)
m2 = loci[0]==loci
mask = ~(m1|m2).all(0)
loci = loci[:,mask]
print (loci_df.columns[mask])
Index(['SNP3', 'SNP4', 'SNP5', 'SNP6', 'SNP7', 'SNP8', 'SNP10'], dtype='object')
filtered=pd.DataFrame(data=loci, columns=loci_df.columns[mask])
print (filtered)
SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP10
0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 0.0 2.0 0.0 1.0 1.0 1.0 NaN
3 0.0 1.0 1.0 1.0 1.0 1.0 0.0
4 2.0 1.0 0.0 1.0 1.0 1.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 1.0 1.0 0.0
7 0.0 0.0 0.0 1.0 1.0 1.0 0.0
8 0.0 1.0 0.0 0.0 1.0 1.0 0.0
9 0.0 1.0 1.0 NaN 1.0 1.0 1.0