数据框视图或副本的好处是什么

Question

我看到很多关于臭名昭著的 SettingWithCopy 警告的问题。我什至冒险回答了其中的一些问题。最近，我整理了一个涉及该主题的答案，我想展示数据框视图的好处。我未能提供具体的演示来说明为什么创建数据框视图或生成 SettingWithCopy

的任何东西是个好主意

考虑df

df = pd.DataFrame([[1, 2], [3, 4]], list('ab'), list('AB'))
df

   A  B
x  1  2
y  3  4

和 dfv 是 df

的副本

dfv = df[['A']]

print(dfv.is_copy)

<weakref at 0000000010916E08; to 'DataFrame' at 000000000EBF95C0>

print(bool(dfv.is_copy))

True

我可以生成 SettingWithCopy

dfv.iloc[0, 0] = 0

但是，dfv已经改变

print(dfv)

   A
a  0
b  3

df还没有

print(df)

   A  B
x  1  2
y  3  4

和dfv仍然是副本

print(bool(dfv.is_copy))

True

如果我改变df

df.iloc[0, 0] = 7
print(df)

   A  B
x  7  2
y  3  4

但是dfv没有改变。但是，我可以从 dfv

引用 df

print(dfv.is_copy())

   A  B
x  7  2
y  3  4

问题

如果 dfv 维护它自己的数据（意思是，它实际上并不节省内存）并且它通过赋值操作分配值而不顾警告，那么我们为什么要首先保存引用并生成 SettingWithCopyWarning 有吗？

什么是有形利益？

Answer 1

已有很多关于此的讨论，请参阅 here for instance, including the attempted PRs. It's also worth noting that true copy-on-write for views is being considered as part of the "pandas 2.0" refactor, see here。

在您的示例中保留引用的原因特别是因为它不是视图，所以如果有人尝试这样做，他们会收到警告。

df[['A']].iloc[0, 0] = 1

编辑：

就"why use views at all,"而言，这是出于性能/内存原因。考虑一下，基本的索引（选择列），因为这个操作取一个视图，几乎是瞬时的。

df = pd.DataFrame(np.random.randn(1000000, 2), columns=['a','b'])

%timeit df['a']
100000 loops, best of 3: 2.13 µs per loop

而获取副本的成本不菲。

%timeit df['a'].copy()
100 loops, best of 3: 4.28 ms per loop

这种性能成本会出现在许多操作中，例如将两个 Series 相加。

%timeit df['a'] + df['b']
100 loops, best of 3: 4.31 ms per loop

%timeit df['a'].copy() + df['b'].copy()
100 loops, best of 3: 13.3 ms per loop

数据框视图或副本的好处是什么

What is the benefit of a dataframe view or copy

python

pandas

chained-assignment

问题