如何使用另一个数据框以特定方式过滤我的数据框?

How to filter my dataframe in specific way with another dataframe?

我有一个数据框 df1:

id1   id2
a1    b1  
c1    d1 
e1    d1
g1    h1   

和 df2:

id    value
a1     10
b1     9
c1     7
d1     11
e1     12
g1     5
h1     8

我只想保留 df1 中的行,前提是它们与 df2 中值列的值相差(差距)不高于 1。因此,所需的输出是:

id1   id2
a1    b1  
e1    d1   

c1 d1 行被删除,因为 7 和 11 之间的差距大于 1。与 g1 h1 相同。怎么做?

IIUC:

df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]

  id1 id2
0  a1  b1
2  e1  d1

更长的答案

# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get

# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)

   id1  id2
0   10    9
1    7   11
2   12   11
3    5    8

# `eval` takes a string argument that describes what
# calculation to do.  See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)

0    1
1    4
2    1
3    3
dtype: int64

# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]

  id1 id2
0  a1  b1
2  e1  d1

这是使用布尔索引的一种方法。思路是stackdf1中的Ids'从df2中得到对应的值,然后过滤差值小于1的行:

out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]

输出:

  id1 id2
0  a1  b1
2  e1  d1

使用 datar、re-imagining 个 pandas 个 API 可以轻松直观地完成此操作:

>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>> 
>>> df1 = tibble(
...     id1=["a1", "c1", "e1", "g1"],
...     id2=["b1", "d1", "d1", "h1"],
... )
>>> 
>>> df2 = tibble(
...     id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
...     value=[10, 9, 7, 11, 12, 5, 8],
... )
>>> 
>>> (
...     df1 
...     >> left_join(df2, by={"id1": f.id})  # get the values of id1
...     >> left_join(df2, by={"id2": f.id})  # get the values of id2
...     >> mutate(diff=abs(f.value_x - f.value_y))  # calculate the diff
...     >> filter(f.diff <= 1)  # filter with diff <= 1
...     >> select(f.id1, f.id2)  # keep only desired columns
... )
       id1      id2
  <object> <object>
0       a1       b1
2       e1       d1