子集 pandas 数据框的最佳方法

Best way to subset a pandas dataframe

嘿,我是 Pandas 的新手,我刚刚遇到 df.query()

为什么人们会使用 df.query() 当您可以使用括号表示法直接过滤数据帧时?官方pandas教程似乎也更喜欢后一种方法。

带括号的表示法:

df[df['age'] <= 21]

用pandas查询方法:

df.query('age <= 21')

除了已经提到的一些风格或灵活性差异之外,是否有一个规范的首选 - 即对大型数据帧的操作性能?

考虑以下示例 DF:

In [307]: df
Out[307]:
  sex  age     name
0   M   40      Max
1   F   35     Anna
2   M   29      Joe
3   F   18    Maria
4   F   23  Natalie

选择 .query() 方法有很多充分的理由。

  • 与布尔索引相比,它可能更短更清晰:

    In [308]: df.query("20 <= age <= 30 and sex=='F'")
    Out[308]:
      sex  age     name
    4   F   23  Natalie
    
    In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
    Out[309]:
      sex  age     name
    4   F   23  Natalie
    
  • 您可以通过编程方式准备条件(查询):

    In [315]: conditions = {'name':'Joe', 'sex':'M'}
    
    In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])
    
    In [317]: q
    Out[317]: 'name=="Joe" and sex=="M"'
    
    In [318]: df.query(q)
    Out[318]:
      sex  age name
    2   M   29  Joe
    

PS也有一些缺点:

  • 我们不能对包含空格的列或仅由数字组成的列使用 .query() 方法
  • 并非所有功能都可以应用,或者在某些情况下我们必须使用 engine='python' 而不是默认的 engine='numexpr'(更快)

注意:Jeff(主要 Pandas 贡献者之一和 Pandas 核心团队的成员)once said:

Note that in reality .query is just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.

documentation 中其他一些有趣的用法。

Reuseable

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying -- (Source)

示例:

dfA = pd.DataFrame([[1,2,3], [4,5,6]], columns=["X", "Y", "Z"])
dfB = pd.DataFrame([[1,3,3], [4,1,6]], columns=["X", "Y", "Z"])
q = "(X > 3) & (Y < 10)"

print(dfA.query(q))
print(dfB.query(q))

   X  Y  Z
1  4  5  6
   X  Y  Z
1  4  1  6

More flexible syntax

df.query('a < b and b < c')  # understand a bit more English

Support in operator and not in (alternative to isin)

df.query('a in [3, 4, 5]') # select rows whose value of column a is in [2, 3, 4]

Special usage of == and != (similar to in/not in)

df.query('a == [1, 3, 5]') # select whose value of column a is in [1, 3, 5]
# equivalent to df.query('a in [1, 3, 5]')