比较数据框中的 2 个元组
comparing 2 tuples in a dataframe
基于以下数据框:
import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)
我正在尝试生成一个新列,如果每一行中至少有一个 test_list 中的元素,那么我将该行标记为 true
df['found'] = any((True for x in test_list if x in df['colors_new']))
df
在上面的示例中,第 0 行和第 2 行应该为真,因为红色在第 0 行,黄色在第 2 行
这将是最有效和最正确的方法,因为我目前得到的结果是错误的
我认为我能得到的最接近正确的回答是
df['found'] = ['red' in x for x in df['colors_new']]
但是当我有多个项目时使用它不起作用 (test_list = ['purple', 'red', 'yellow'])
您可以使用 lambda
函数来获取您想要的内容:
import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)
df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
使用爆炸
df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
输出:
numbers colors colors_new found
0 1 [red, blue] (red, blue) True
1 2 [white] (white,) False
2 3 [blue, yellow] (blue, yellow) True
使用 python 套
可以用集合和set.intersection
,如果交集不为空,则有公共值。
集合操作比经典循环更快。
test_list = set(test_list)
df['found'] = df['colors_new'].apply(lambda x: len(test_list.intersection(x))>0)
输出:
numbers colors colors_new found
0 1 [red, blue] (red, blue) True
1 2 [white] (white,) False
2 3 [blue, yellow] (blue, yellow) True
注意。作为奖励,您可以使用相同的方法来获取找到的元素
df['found elements'] = df['colors_new'].apply(test_list.intersection)
输出:
numbers colors colors_new found found elements
0 1 [red, blue] (red, blue) True {red}
1 2 [white] (white,) False {}
2 3 [blue, yellow] (blue, yellow) True {yellow}
您也可以使用列表推导式:
df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
如果您要检查的 colors
列很多(不只是 2 个):
df["colors_map"] = df[[x for x in df.columns if "colors" in x]].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
如果性能很重要,请使用带有 isdisjoint
的集合:
s = set(test_list)
df['colors_new'] = ~df.colors.map(s.isdisjoint)
或者:
s = set(test_list)
df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
print (df)
numbers colors colors_new
0 1 [red, blue] True
1 2 [white] False
2 3 [blue, yellow] True
性能在测试数据中,最好的真实测试,因为取决于DataFrame的长度,测试列表的长度,匹配值的数量:
df['colors_new'] = df.colors.map(tuple)
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
test_list = ['purple', 'red', 'yellow']
s = set(test_list)
In [46]: %timeit df['colors_new'] = ~df.colors.map(s.isdisjoint)
707 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [47]: %timeit df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
1.38 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [50]: %timeit df['found'] = df['colors_new'].apply(lambda x: len(s.intersection(x))>0)
1.68 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [51]: %timeit df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
4.66 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
2.91 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [54]: %timeit df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
26.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
基于以下数据框:
import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)
我正在尝试生成一个新列,如果每一行中至少有一个 test_list 中的元素,那么我将该行标记为 true
df['found'] = any((True for x in test_list if x in df['colors_new']))
df
在上面的示例中,第 0 行和第 2 行应该为真,因为红色在第 0 行,黄色在第 2 行
这将是最有效和最正确的方法,因为我目前得到的结果是错误的
我认为我能得到的最接近正确的回答是
df['found'] = ['red' in x for x in df['colors_new']]
但是当我有多个项目时使用它不起作用 (test_list = ['purple', 'red', 'yellow'])
您可以使用 lambda
函数来获取您想要的内容:
import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)
df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
使用爆炸
df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
输出:
numbers colors colors_new found
0 1 [red, blue] (red, blue) True
1 2 [white] (white,) False
2 3 [blue, yellow] (blue, yellow) True
使用 python 套
可以用集合和set.intersection
,如果交集不为空,则有公共值。
集合操作比经典循环更快。
test_list = set(test_list)
df['found'] = df['colors_new'].apply(lambda x: len(test_list.intersection(x))>0)
输出:
numbers colors colors_new found
0 1 [red, blue] (red, blue) True
1 2 [white] (white,) False
2 3 [blue, yellow] (blue, yellow) True
注意。作为奖励,您可以使用相同的方法来获取找到的元素
df['found elements'] = df['colors_new'].apply(test_list.intersection)
输出:
numbers colors colors_new found found elements
0 1 [red, blue] (red, blue) True {red}
1 2 [white] (white,) False {}
2 3 [blue, yellow] (blue, yellow) True {yellow}
您也可以使用列表推导式:
df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
如果您要检查的 colors
列很多(不只是 2 个):
df["colors_map"] = df[[x for x in df.columns if "colors" in x]].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
如果性能很重要,请使用带有 isdisjoint
的集合:
s = set(test_list)
df['colors_new'] = ~df.colors.map(s.isdisjoint)
或者:
s = set(test_list)
df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
print (df)
numbers colors colors_new
0 1 [red, blue] True
1 2 [white] False
2 3 [blue, yellow] True
性能在测试数据中,最好的真实测试,因为取决于DataFrame的长度,测试列表的长度,匹配值的数量:
df['colors_new'] = df.colors.map(tuple)
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
test_list = ['purple', 'red', 'yellow']
s = set(test_list)
In [46]: %timeit df['colors_new'] = ~df.colors.map(s.isdisjoint)
707 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [47]: %timeit df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
1.38 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [50]: %timeit df['found'] = df['colors_new'].apply(lambda x: len(s.intersection(x))>0)
1.68 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [51]: %timeit df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
4.66 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
2.91 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [54]: %timeit df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
26.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)