使用 numpy 数组列值的条件过滤 Pandas DataFrame

Question

我有一个名为 'dt' 的 Pandas DataFrame，它有两列，名为 'A' 和 'B'。 'B' 列的值是 numpy 数组；像这样：

index   A   B
0       a   [1,2,3]
1       b   [2,3,4]
2       c   [3,4,5]

其中：

type (dt["B"][0])

returns: numpy.ndarray

我想过滤此 DataFrame 以获取另一个 DataFrame，其中仅存在存储在 'B' 中的 numpy 数组中具有特定元素的行。

我试过这个：

dt [element in dt["B"]]

例如：

dt [2 in dt["B"]]

应该return:

index   A   B
0       a   [1,2,3]
1       b   [2,3,4]

但这会导致错误，即“KeyError: True”

如果“B”列的值是字符串，我可以毫无错误地完成同样的操作：

dt [dt["B"]==value]

所以我想知道为什么我的代码不起作用，以及“KeyError: True”是什么意思。

完整的错误是这样的：

KeyError                                  Traceback (most recent call last)
~/Applications/Conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: True

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-151-aa9ea046a48f> in <module>
----> 1 quotes_of_base["BTC" in quotes_of_base["quote"]]

~/Applications/Conda/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

~/Applications/Conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: True

Answer 1

请记住，索引数据框需要一个 True/False 值的列表，因此如果需要推送，您仍然可以在其他地方构建该列表（列表理解/ for 循环）并将其传递到 df 中dt[contructed_true_false_list]。只需确保 df 的每一行都有一个条目。

在没有具体示例的情况下很难提出解决方案，但您可以尝试这样的方法：

[True if np.any(my_np_array == element) else False for my_np_array in dt["B"].values]

Answer 2

我结合了评论者的回答。请注意，当我读取列表中的数据时，它们以字符串形式出现，因此您可能需要使用其中的 str(2) 部分。

df[df.apply(lambda x: True if str(2) in x['B'] else False, axis=1)]

   A        B
0  a  [1,2,3]
1  b  [2,3,4]

Answer 3

假设你有类似的东西：

      A         B
  0  10   [11, 0]
  1  20  [11, 10]
  2  30  [11, 10]
  3  40   [10, 0]
  4  50   [11, 0]
  5  60   [10, 0]

并且只想过滤数组中包含 10

的那些

      A         B
  1  20  [11, 10]
  2  30  [11, 10]
  3  40   [10, 0]
  5  60   [10, 0]

你可以使用.apply

  #create the dataframe
  df = pd.DataFrame(columns = ['A','B'])
  df.A = [10,20,30,40,50,60]
  df.B = [[11,0],[11,10],[11,10],[10,0],[11,0],[10,0]]

  # results is a boolean indicating whether the value is found in the list
  # apply the filter in the column 'B' of the dataframe
  results = df.B.apply(lambda a: 10 in a)

  # filter the dataframe based on the boolean
  df_filtered = df[results]
  print(df_filtered)

然后你得到：

            A   B
  1         20  [11, 10]
  2         30  [11, 10]
  3         40   [10, 0]
  5         60   [10, 0]

you can find more details at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

使用 numpy 数组列值的条件过滤 Pandas DataFrame

Filtering Pandas DataFrame using a condition on column values that are numpy arrays

python

filtering

conditional-statements

dataframe

pandas