在 Python 和 Pandas 中，我有一个函数可以更改 DataFrame 的索引。但是，它也改变了原始 DataFrame 的索引

Question

我有以下 analysis.py 文件。函数 group_analysis 通过 df_input

的 Count 列更改 df_input 的日期时间索引

# analysis.py
import pandas as pd

def group_analysis(df_input):
    df_input.index = df_input.index - pd.to_timedelta(df_input.Count, unit = 'days')
    df_ouput = df_input.sort_index()

    return df_ouput

def test(df):
    df = df + 1
    return df

我有以下数据框。

x = pd.DataFrame(np.arange(1,14), index = pd.date_range('2020-01-01', periods = 13, freq= 'D'), columns = ['Count'])

            Count
2020-01-01      1
2020-01-02      2
2020-01-03      3
2020-01-04      4
2020-01-05      5
2020-01-06      6
2020-01-07      7
2020-01-08      8
2020-01-09      9
2020-01-10     10
2020-01-11     11
2020-01-12     12
2020-01-13     13

当我运行下面的代码时，

import analysis
y = analysis.group_analysis(x)

x 和 y 的日期时间索引都已更改（因此，x.equals(y) 为 True）。为什么 group_analysis 更改输入和输出日期时间索引？我怎样才能让它只改变 y 的日期时间索引（而不是 x）？

然而，当运行使用下面的代码时，x不会改变（所以，x.equals(y)是True）

import analysis
y = analysis.test(x)

编辑：添加 analysis.test(df)。

Answer 1

此行为的原因是因为在调用 group_analysis 时您没有将数据帧的副本传递给函数，而是对计算机内存中原始数据的引用。所以修改后面的数据，原来的数据（是一样的）也会被修改

有关很好的解释，请参阅 https://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/。

为防止在输入函数时创建数据副本：

...
def group_analysis(df):
    df_input = df.copy()
    ...

Answer 2

当您将数据帧传递给函数时，它会传递数据帧引用。因此，您对数据框所做的任何就地更改都会反映在传递的数据框中。

但是对于您的 test 函数，添加 returns 内存中数据帧的副本。我怎么知道的？只需打印操作前后变量的内存引用id即可。

>>> def test(df):
...     print(id(df))
...     df = df + 1
...     print(id(df))
...     return df
... 
>>> test(df)
139994174011920
139993943207568

注意到变化了吗？这意味着它的引用已更改。因此不影响原始数据帧。

在 Python 和 Pandas 中，我有一个函数可以更改 DataFrame 的索引。但是，它也改变了原始 DataFrame 的索引

In Python with Pandas, I have a function to change the index of DataFrame. But, it also changes the index of the original DataFrame

python

function

pandas

datetimeindex