根据不同类型的值导出日期列
Derive date column based on different types of values
我有一个如下所示的数据框
df = pd.DataFrame({'subject_id' :[1,2,3,4,5],
'date_of_interview':['2007-05-27','2008-03-13','2010-11-19','2011-10-05','2004-11-02'],
'Age':[31,35,78,72,43],
'value':[6,0.33,1990,np.nan,2001],
'age_detected':[25,35,98,65,40]})
df['date_of_interview'] = pd.to_datetime(df['date_of_interview'])
我想根据 value
和 age_detected
列
创建一个名为 dis_date
的新列
例如:subject_id = 1 有 date_of_interview 作为 2007-05-27。如果我们查看他的值列,我们可以看到他的值为 6,这意味着我们必须从 date_of_interview 中减去 6 年才能得到 2001-05-27
为 dis_date
而如果您查看 subject_id = 3,他的值列中有年份值,因此他的 dis_date 将是 1990-11-19
当value栏里有NA
的时候,我们要看他的age_detected
栏,用Age
减去就得到年数
例如:subject_id = 4 Age
为 72,AGE_DETECTED
为 65。现在差异是 7,他的 dis_date 将是 2004-10-05
如果少于 6 位表示年份,请注意值列中的值。如果为1,则表示减去1年。如果它是 0.33,则意味着减去 4 个月。 1 年 = 12 个月。 0.33 = 3.96 个月(4 个月)
我正在尝试类似的方法,但没有帮助
for i in range(len(df['value'])):
if (len(str(df['value'][i]))) < 6:
df['dis_date'] = df['date_of_interview'] - pd.DateOffset(years=df['value'][i])
我希望我的输出如下所示
在此解决方案中,创建了辅助列以验证替换的年份或减去的月份:
#if value less like 1 multiple by 12, another values set to NaNs
df['m1'] = np.where(df['value'].lt(1), df['value'].mul(12).round(), np.nan)
#if values more like 1000 it is year
df['y1'] = df['value'].where(df['value'].gt(1000))
#if values between 1, 1000 is necessary subtract years from value column
y1 = df['Age'].sub(df['age_detected'])
df['y2'] = np.where(y1.between(1, 1000), df['date_of_interview'].dt.year.sub(y1), np.nan)
#joined years to one column
df['y'] = df['y1'].fillna(df['y2'])
#replaced years by another column
f1 = lambda x: x['date_of_interview'] - pd.DateOffset(year=(int(x['y'])))
df['dis_date1'] = df.dropna(subset=['date_of_interview','y']).apply(f1, axis=1)
#subtracted months if non missing values
f1 = lambda x: x['date_of_interview'] - pd.DateOffset(months=(int(x['m1'])))
df['dis_date2'] = df.dropna(subset=['m1']).apply(f1, axis=1)
#join together
df['dis_date'] = df['dis_date1'].fillna(df['dis_date2'])
print (df)
subject_id date_of_interview Age value age_detected m1 y1 \
0 1 2007-05-27 31 6.00 25 NaN NaN
1 2 2008-03-13 35 0.33 35 4.0 NaN
2 3 2010-11-19 78 1990.00 98 NaN 1990.0
3 4 2011-10-05 72 NaN 65 NaN NaN
4 5 2004-11-02 43 2001.00 40 NaN 2001.0
y2 y dis_date1 dis_date2 dis_date
0 2001.0 2001.0 2001-05-27 NaT 2001-05-27
1 NaN NaN NaT 2007-11-13 2007-11-13
2 NaN 1990.0 1990-11-19 NaT 1990-11-19
3 2004.0 2004.0 2004-10-05 NaT 2004-10-05
4 2001.0 2001.0 2001-11-02 NaT 2001-11-02
我有一个如下所示的数据框
df = pd.DataFrame({'subject_id' :[1,2,3,4,5],
'date_of_interview':['2007-05-27','2008-03-13','2010-11-19','2011-10-05','2004-11-02'],
'Age':[31,35,78,72,43],
'value':[6,0.33,1990,np.nan,2001],
'age_detected':[25,35,98,65,40]})
df['date_of_interview'] = pd.to_datetime(df['date_of_interview'])
我想根据 value
和 age_detected
列
dis_date
的新列
例如:subject_id = 1 有 date_of_interview 作为 2007-05-27。如果我们查看他的值列,我们可以看到他的值为 6,这意味着我们必须从 date_of_interview 中减去 6 年才能得到 2001-05-27
为 dis_date
而如果您查看 subject_id = 3,他的值列中有年份值,因此他的 dis_date 将是 1990-11-19
当value栏里有NA
的时候,我们要看他的age_detected
栏,用Age
减去就得到年数
例如:subject_id = 4 Age
为 72,AGE_DETECTED
为 65。现在差异是 7,他的 dis_date 将是 2004-10-05
如果少于 6 位表示年份,请注意值列中的值。如果为1,则表示减去1年。如果它是 0.33,则意味着减去 4 个月。 1 年 = 12 个月。 0.33 = 3.96 个月(4 个月)
我正在尝试类似的方法,但没有帮助
for i in range(len(df['value'])):
if (len(str(df['value'][i]))) < 6:
df['dis_date'] = df['date_of_interview'] - pd.DateOffset(years=df['value'][i])
我希望我的输出如下所示
在此解决方案中,创建了辅助列以验证替换的年份或减去的月份:
#if value less like 1 multiple by 12, another values set to NaNs
df['m1'] = np.where(df['value'].lt(1), df['value'].mul(12).round(), np.nan)
#if values more like 1000 it is year
df['y1'] = df['value'].where(df['value'].gt(1000))
#if values between 1, 1000 is necessary subtract years from value column
y1 = df['Age'].sub(df['age_detected'])
df['y2'] = np.where(y1.between(1, 1000), df['date_of_interview'].dt.year.sub(y1), np.nan)
#joined years to one column
df['y'] = df['y1'].fillna(df['y2'])
#replaced years by another column
f1 = lambda x: x['date_of_interview'] - pd.DateOffset(year=(int(x['y'])))
df['dis_date1'] = df.dropna(subset=['date_of_interview','y']).apply(f1, axis=1)
#subtracted months if non missing values
f1 = lambda x: x['date_of_interview'] - pd.DateOffset(months=(int(x['m1'])))
df['dis_date2'] = df.dropna(subset=['m1']).apply(f1, axis=1)
#join together
df['dis_date'] = df['dis_date1'].fillna(df['dis_date2'])
print (df)
subject_id date_of_interview Age value age_detected m1 y1 \
0 1 2007-05-27 31 6.00 25 NaN NaN
1 2 2008-03-13 35 0.33 35 4.0 NaN
2 3 2010-11-19 78 1990.00 98 NaN 1990.0
3 4 2011-10-05 72 NaN 65 NaN NaN
4 5 2004-11-02 43 2001.00 40 NaN 2001.0
y2 y dis_date1 dis_date2 dis_date
0 2001.0 2001.0 2001-05-27 NaT 2001-05-27
1 NaN NaN NaT 2007-11-13 2007-11-13
2 NaN 1990.0 1990-11-19 NaT 1990-11-19
3 2004.0 2004.0 2004-10-05 NaT 2004-10-05
4 2001.0 2001.0 2001-11-02 NaT 2001-11-02