比较季度数据：Python(Pandas) 中的迭代以比较导入为数据框的四个不同 excel 文件的多列

Question

亲爱的 Whosebug 社区，我有一个 excel 文件 "big_excel.xlsx"，它包含四列，即 "date_column" ，“功效”，“成分” 和“测试组”。基本上，我将这个 excel 每季度拆分一次 "q1..q4"，这样我就可以将每列中的值与我收到的 4 个不同的 excel 进行比较来自 4 个不同的来源，应该是 100% 相同的。来自发件人的 excel 元素已经按照这样的方式排序，它应该与按季度拆分的 excel 完全匹配。我的代码非常适合 quarter q1。为了进行比较，我使用了“.equals”，因为它可以有 nans。现在我必须对剩余的季度应用相同的代码概念 q2..q4.

import pandas as pd
from os.path import expanduser as ospath
import numpy as np


df = pd.read_excel(ospath('big_excel.xlsx'))

df.date_column = pd.to_datetime(df.date_column)

df['quarters'] = df.date_column.dt.quarter

q1 = df[df.quarters == 1]

q2 = df[df.quarters == 2].reset_index(drop=True)

q3 = df[df.quarters == 3].reset_index(drop=True)

q4 = df[df.quarters == 4].reset_index(drop=True)


test_excel_q1 = pd.read_excel(ospath('from_biontech.xlsx'))

test_excel_q2 = pd.read_excel(ospath('from_astrazeneca.xlsx'))

test_excel_q3 = pd.read_excel(ospath('from_sputnik.xlsx'))

test_excel_q4 = pd.read_excel(ospath('from_moderna.xlsx'))




q1['compare_date_column'] = np.where(q1[q1.columns[1]].equals(test_excel_q1[test_excel_q1.columns[1]]), 'True', 'False')  
q1['compare_efficacy'] = np.where(q1[q1.columns[2]].equals(test_excel_q1[test_excel_q1.columns[2]]), 'True', 'False')
q1['compare_composition'] = np.where(q1[q1.columns[3]].equals(test_excel_q1[test_excel_q1.columns[3]]), 'True', 'False')
q1['compare_testgroups'] = np.where(q1[q1.columns[4]].equals(test_excel_q1[test_excel_q1.columns[4]]), 'True', 'False')

为此，我显然可以在 q1['compare_date_column']、q1['compare_efficacy']、q1['compare_composition'] 中更改 q1-> q2、q3、q4和 q1['compare_testgroups']，然后复制和粘贴。但是，这是一个肮脏的解决方案，如果我将来增加专栏，我会很困惑。所以，我想知道我的问题是否可以通过迭代来解决。

我的想法：创建一个变量列表 var_list = [q1,q2,q3,q4]，其中对于 var_list 中的每个索引，它采用索引 i 并迭代地替换它

q1['compare_date_column'] = np.where(q1[q1.columns[1]].equals(test_excel_q1[test_excel_q1.columns[1]]), 'True', 'False')  
q1['compare_efficacy'] = np.where(q1[q1.columns[2]].equals(test_excel_q1[test_excel_q1.columns[2]]), 'True', 'False')
q1['compare_composition'] = np.where(q1[q1.columns[3]].equals(test_excel_q1[test_excel_q1.columns[3]]), 'True', 'False')
q1['compare_testgroups'] = np.where(q1[q1.columns[4]].equals(test_excel_q1[test_excel_q1.columns[4]]), 'True', 'False')

我是否需要为此定义一个函数，如果是的话谁能帮助我，因为我还在学习python。我将非常感谢您提供给我的任何意见。非常感谢您的时间和精力。

Answer 1

一种方法是定义一个函数，该函数采用四分之一的数据帧和该季度的相应测试数据帧，以及 returns 带有比较列的原始数据帧。类似于：

# you can also use this function to compare just one quarter
def compare_quarter(df_q:pd.DataFrame, df_test_q:pd.DataFrame):
    # this do exactly the same as your 4 comparing code lines
    df_q[[
        'compare_date_column',
        'compare_efficacy',
        'compare_composition',
        'compare_testgroups'
    ]] = \
        [np.where(df_q.iloc[:, i].equals(df_test_q.iloc[:, i]), 'True', 'False') for i in range(1,5)]

    return df_q

然后您只需在季度上迭代该函数：

for q, t in zip([q1, q2, q3, q4], [test_excel_q1, test_excel_q2, test_excel_q3, test_excel_q4]):
    q = compare_quarter(q, t)

注意：我注意到当您比较每一列时，您是在将季度和测试列作为一个整体进行比较。这意味着：如果只有一行不同，则整个 compare_column（所有行）将是 False。如果要按元素进行比较，请在函数中使用 eq 方法，例如：

def compare_quartals(df_q:pd.DataFrame, df_test_q:pd.DataFrame):
    comp_cols = [
        'compare_date_column',
        'compare_efficacy',
        'compare_composition',
        'compare_testgroups'
    ]

    for i in range(1,5):
        df_q[comp_cols[i-1]] = df_q.iloc[:, i].eq(df_test_q.iloc[:, i])

    return df_q

比较季度数据：Python(Pandas) 中的迭代以比较导入为数据框的四个不同 excel 文件的多列

Comparing quarterly data: Iteration in Python(Pandas) to compare multiple columns from four different excel files imported as dataframe

python

iteration

loops

function

pandas