Python:遍历 csv 的行并计算日期差异(如果列中有更改)
Python: iterate through the rows of a csv and calculate date difference if there is a change in a column
只有Python的基础知识,所以我什至不确定这是否可行?
我有一个如下所示的 csv:
[1]: https://i.stack.imgur.com/8clYM.png
(这是虚拟数据,真实数据大约有 30K 行。)
我需要找到每个员工(唯一 ID)的最新职位,然后计算该员工的工作时间(=多少天)担任同一职位。
到目前为止我做了什么:
import csv
import datetime
from datetime import *
data = open("C:\Users\User\PycharmProjects\pythonProject\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
但后来我一无所获,因为我什至无法概念化如何构建它。 :-(
我也知道在某一时刻我需要:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
有人可以帮忙吗?我只需要一个新列表,内容如下:
我的日子
500 管道工 370
编辑澄清:日期是获取的数据点。我需要从最近的那些开始计算,直到职位名称是别的东西。因此,在我的示例中,员工 5000 从 04/07/2021 到 01/03/2020。
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
有员工到(日期,职位)的地图(dict)。
对于每一行,检查您是否已有该员工的条目。如果您不只是将信息放在地图中,否则请比较行的日期和条目的日期。如果该行的日期较新,则替换该条目。
浏览完所有行后,您可以浏览您收集的地图并计算结束日期与“今天”之间的差异。
顺便说一句,你的模式不正确,样本数据使用了%d/%m/%Y
(day/month/year)或%m/%d/%Y
(month/day/year)格式,样本数据不充分要说哪个,但肯定不是YMD。
让我们考虑如下示例数据:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
以下代码有效。
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
输出:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
看来我来晚了...不过,如果您有兴趣,这里有一个纯 Python 的建议(不过 Pandas 没问题!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
输入数据的结果
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
是
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]
只有Python的基础知识,所以我什至不确定这是否可行?
我有一个如下所示的 csv: [1]: https://i.stack.imgur.com/8clYM.png (这是虚拟数据,真实数据大约有 30K 行。) 我需要找到每个员工(唯一 ID)的最新职位,然后计算该员工的工作时间(=多少天)担任同一职位。
到目前为止我做了什么:
import csv
import datetime
from datetime import *
data = open("C:\Users\User\PycharmProjects\pythonProject\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
但后来我一无所获,因为我什至无法概念化如何构建它。 :-( 我也知道在某一时刻我需要:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
有人可以帮忙吗?我只需要一个新列表,内容如下: 我的日子 500 管道工 370
编辑澄清:日期是获取的数据点。我需要从最近的那些开始计算,直到职位名称是别的东西。因此,在我的示例中,员工 5000 从 04/07/2021 到 01/03/2020。
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
有员工到(日期,职位)的地图(dict)。
对于每一行,检查您是否已有该员工的条目。如果您不只是将信息放在地图中,否则请比较行的日期和条目的日期。如果该行的日期较新,则替换该条目。
浏览完所有行后,您可以浏览您收集的地图并计算结束日期与“今天”之间的差异。
顺便说一句,你的模式不正确,样本数据使用了%d/%m/%Y
(day/month/year)或%m/%d/%Y
(month/day/year)格式,样本数据不充分要说哪个,但肯定不是YMD。
让我们考虑如下示例数据:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
以下代码有效。
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
输出:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
看来我来晚了...不过,如果您有兴趣,这里有一个纯 Python 的建议(不过 Pandas 没问题!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
输入数据的结果
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
是
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]