如何使用条件遍历 Pandas 数据框? (混淆 iterrows/for loops/vectorization)
How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)
我有一个数据集需要根据条件进行迭代:
data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])
for x in range(len(df))
if df['state'] and df['capital'] = np.nan:
df['state'] = 'Investigate state'
df['capital'] = 'Investigate capital'
我的预期输出是,如果状态字段和资本字段都为空,则分别填写空字段。我使用的实际数据和循环中的函数比这个例子更复杂,但我想关注的是带有条件的 iterative/looping 部分。
我的谷歌搜索找到了 iterrows 并且我阅读了教程,这些教程只是说继续使用 for 循环。 Whosebug 的回答谴责了上面的两个选项,并提倡使用矢量化。我的实际数据集大约有 20,000 行。什么是最有效的实施方式?我该如何实施?
您可以分别测试每一列,并通过 &
为按位 AND
:
链接掩码
m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']
最快的是在 30k 行的样本数据中,如果还单独设置列,匹配率为 66%:
m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])
相似:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
...: m = df['state'].isna() & df['capital'].isna()
...: df['state']= np.where(m, 'Investigate state', df['state'])
...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
...:
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']
4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
另一种解决方案:
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})
6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我有一个数据集需要根据条件进行迭代:
data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])
for x in range(len(df))
if df['state'] and df['capital'] = np.nan:
df['state'] = 'Investigate state'
df['capital'] = 'Investigate capital'
我的预期输出是,如果状态字段和资本字段都为空,则分别填写空字段。我使用的实际数据和循环中的函数比这个例子更复杂,但我想关注的是带有条件的 iterative/looping 部分。
我的谷歌搜索找到了 iterrows 并且我阅读了教程,这些教程只是说继续使用 for 循环。 Whosebug 的回答谴责了上面的两个选项,并提倡使用矢量化。我的实际数据集大约有 20,000 行。什么是最有效的实施方式?我该如何实施?
您可以分别测试每一列,并通过 &
为按位 AND
:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']
最快的是在 30k 行的样本数据中,如果还单独设置列,匹配率为 66%:
m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])
相似:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
...: m = df['state'].isna() & df['capital'].isna()
...: df['state']= np.where(m, 'Investigate state', df['state'])
...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
...:
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']
4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
另一种解决方案:
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})
6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)