Python:遍历 csv 的行并计算日期差异(如果列中有更改)

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

只有Python的基础知识,所以我什至不确定这是否可行?

我有一个如下所示的 csv: [1]: https://i.stack.imgur.com/8clYM.png (这是虚拟数据,真实数据大约有 30K 行。) 我需要找到每个员工(唯一 ID)的最新职位,然后计算该员工的工作时间(=多少天)担任同一职位。

到目前为止我做了什么:

import csv
import datetime
from datetime import *

data = open("C:\Users\User\PycharmProjects\pythonProject\jts.csv",encoding="utf-8")
csv_data  = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)

for i in data_lines:
    for j in i[0]: 

但后来我一无所获,因为我什至无法概念化如何构建它。 :-( 我也知道在某一时刻我需要:

datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()

有人可以帮忙吗?我只需要一个新列表,内容如下: 我的日子 500 管道工 370

编辑澄清:日期是获取的数据点。我需要从最近的那些开始计算,直到职位名称是别的东西。因此,在我的示例中,员工 5000 从 04/07/2021 到 01/03/2020。

But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(

有员工到(日期,职位)的地图(dict)。

对于每一行,检查您是否已有该员工的条目。如果您不只是将信息放在地图中,否则请比较行的日期和条目的日期。如果该行的日期较新,则替换该条目。

浏览完所有行后,您可以浏览您收集的地图并计算结束日期与“今天”之间的差异。

顺便说一句,你的模式不正确,样本数据使用了%d/%m/%Y(day/month/year)或%m/%d/%Y(month/day/year)格式,样本数据不充分要说哪个,但肯定不是YMD。

让我们考虑如下示例数据:

id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019

以下代码有效。

import pandas as pd
import datetime

# load data
data = pd.read_csv('data.csv')

# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)

# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)

# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)

# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)

输出:

             jtitle         date           days
id                                             
5000         senior plumber 2020-03-02  61 days
6000  software architecture 2021-02-06 371 days
7000        software tester 2019-02-06      NaT

看来我来晚了...不过,如果您有兴趣,这里有一个纯 Python 的建议(不过 Pandas 没问题!):

import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby

reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
               for i, jtitle, date in reader),
              key=itemgetter(0, 2), reverse=True)

# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
    _, jtitle, end = next(group)  # Fetch last job title resp. date

    # Search for first ocurrence of different job title:
    start = end
    for _, jt, start in group:
        if jt != jtitle:
            break

    # Collect results in list with datetimes transformed back
    result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days)) 

result = sorted(result, key=itemgetter(0))

输入数据的结果

id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021

[('5000', 'head plumber', '04/07/2021', 490),
 ('6000', 'qualified electrician', '01/06/2020', 851),
 ('7000', 'senior plumber', '05/06/2021', 208)]