如何使用另一个数据框以特定方式过滤我的数据框?
How to filter my dataframe in specific way with another dataframe?
我有一个数据框 df1:
id1 id2
a1 b1
c1 d1
e1 d1
g1 h1
和 df2:
id value
a1 10
b1 9
c1 7
d1 11
e1 12
g1 5
h1 8
我只想保留 df1 中的行,前提是它们与 df2 中值列的值相差(差距)不高于 1。因此,所需的输出是:
id1 id2
a1 b1
e1 d1
c1 d1 行被删除,因为 7 和 11 之间的差距大于 1。与 g1 h1 相同。怎么做?
IIUC:
df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]
id1 id2
0 a1 b1
2 e1 d1
更长的答案
# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get
# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)
id1 id2
0 10 9
1 7 11
2 12 11
3 5 8
# `eval` takes a string argument that describes what
# calculation to do. See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)
0 1
1 4
2 1
3 3
dtype: int64
# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]
id1 id2
0 a1 b1
2 e1 d1
这是使用布尔索引的一种方法。思路是stack
df1
中的Ids'从df2
中得到对应的值,然后过滤差值小于1的行:
out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]
输出:
id1 id2
0 a1 b1
2 e1 d1
使用 datar
、re-imagining 个 pandas 个 API 可以轻松直观地完成此操作:
>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>>
>>> df1 = tibble(
... id1=["a1", "c1", "e1", "g1"],
... id2=["b1", "d1", "d1", "h1"],
... )
>>>
>>> df2 = tibble(
... id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
... value=[10, 9, 7, 11, 12, 5, 8],
... )
>>>
>>> (
... df1
... >> left_join(df2, by={"id1": f.id}) # get the values of id1
... >> left_join(df2, by={"id2": f.id}) # get the values of id2
... >> mutate(diff=abs(f.value_x - f.value_y)) # calculate the diff
... >> filter(f.diff <= 1) # filter with diff <= 1
... >> select(f.id1, f.id2) # keep only desired columns
... )
id1 id2
<object> <object>
0 a1 b1
2 e1 d1
我有一个数据框 df1:
id1 id2
a1 b1
c1 d1
e1 d1
g1 h1
和 df2:
id value
a1 10
b1 9
c1 7
d1 11
e1 12
g1 5
h1 8
我只想保留 df1 中的行,前提是它们与 df2 中值列的值相差(差距)不高于 1。因此,所需的输出是:
id1 id2
a1 b1
e1 d1
c1 d1 行被删除,因为 7 和 11 之间的差距大于 1。与 g1 h1 相同。怎么做?
IIUC:
df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]
id1 id2
0 a1 b1
2 e1 d1
更长的答案
# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get
# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)
id1 id2
0 10 9
1 7 11
2 12 11
3 5 8
# `eval` takes a string argument that describes what
# calculation to do. See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)
0 1
1 4
2 1
3 3
dtype: int64
# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]
id1 id2
0 a1 b1
2 e1 d1
这是使用布尔索引的一种方法。思路是stack
df1
中的Ids'从df2
中得到对应的值,然后过滤差值小于1的行:
out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]
输出:
id1 id2
0 a1 b1
2 e1 d1
使用 datar
、re-imagining 个 pandas 个 API 可以轻松直观地完成此操作:
>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>>
>>> df1 = tibble(
... id1=["a1", "c1", "e1", "g1"],
... id2=["b1", "d1", "d1", "h1"],
... )
>>>
>>> df2 = tibble(
... id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
... value=[10, 9, 7, 11, 12, 5, 8],
... )
>>>
>>> (
... df1
... >> left_join(df2, by={"id1": f.id}) # get the values of id1
... >> left_join(df2, by={"id2": f.id}) # get the values of id2
... >> mutate(diff=abs(f.value_x - f.value_y)) # calculate the diff
... >> filter(f.diff <= 1) # filter with diff <= 1
... >> select(f.id1, f.id2) # keep only desired columns
... )
id1 id2
<object> <object>
0 a1 b1
2 e1 d1