使用 unittest 框架测试 pandas 数据框

Testing pandas dataframe with unittest framework

我正在尝试使用 python unittest 框架进行处理 csv 文件的单元测试。 我想测试列名匹配、列中的值匹配等情况。 我知道有更方便的库,比如 datatestpytest ,但我只能在我的项目中使用 unittest

猜想我使用了错误的 unittest.TestCase 方法,并以错误的格式发送数据。 请告知如何更好地做到这一点。

db.csv 示例:

  TIMESTAMP   TYPE   VALUE YEAR  FILE   SHEET
0 02-09-2018  Index   45   2018  tq.xls A01
1 13-05-2018  Index   21   2018  tq.xls A01
2 22-01-2019  Index   9    2019  aq.xls B02

这是代码示例:

import pandas as pd
import unittest

class DFTests(unittest.TestCase):

    def setUp(self):
        test_file_name =  'db.csv'
        try:
            data = pd.read_csv(test_file_name,
                sep = ',',
                header = 0)
        except IOError:
            print('cannot open file')
        self.fixture = data

    #Check column names
    def test_columns(self):
        self.assertEqual(
            self.fixture.columns,
            {'TIMESTAMP', 'TYPE', 'VALUE','YEAR','FILE','SHEET'},
        )

    #Check timestamp format
    def test_timestamp(self):
        self.assertRaisesRegex(
            self.fixture['TIMESTAMP'],
            r'\d{2}-\d{2}-\d{4}'
        )

    #Check year values
    def test_year_values(self):
        self.assertIn(
            self.fixture['YEAR'],
            {2018, 2019, 2020},
        )


if __name__ == '__main__':
    unittest.main()

错误:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
TypeError: assertRaisesRegex() arg 1 must be an exception type or tuple of exception types
TypeError: 'Series' objects are mutable, thus they cannot be hashed

感谢任何帮助。

您可以使用列表推导式对每个数据框行进行断言。 尝试这样的事情:

import pandas as pd
import unittest

colnames = ["TIMESTAMP", " TYPE", " VALUE", " YEAR", " FILE", " SHEET"]
years = set([2018, 2019, 2020])


class DfTests(unittest.TestCase):
    def setUp(self):
        try:
            data = pd.read_csv("data.csv", sep=",")
            self.fixture = data
        except IOError as e:
            print(e)

    def test_colnames(self):
        self.assertListEqual(list(self.fixture.columns), colnames)

    def test_timestamp_format(self):
        ts = self.fixture["TIMESTAMP"]
        # You need to check for every row in the dataframe
        [self.assertRegex(i, r"\d{2}-\d{2}-\d{4}") for i in ts]

    def test_years(self):
        df_years = self.fixture[" YEAR"]
        self.assertTrue(all([i in years for i in df_years]))


if __name__ == "__main__":
    unittest.main()

此外,请记住 pandas 有一些 built-in testing functions. On the other hand, when unit-testing dataframes (and general data validation) great_expectations 可能是完成这项工作的最佳工具。