用于计算大型数据帧的更快的函数或脚本
Faster function or script for computing a large dataframe
我有包含以下信息的在线用户数据,
df.head()
USER Timestamp day_of_week Busi_days Busi_hours
AAD 2017-07-11 09:31:44 TRUE TRUE
AAD 2017-07-11 23:24:43 TRUE FALSE
AAD 2017-07-12 13:24:43 TRUE TRUE
SAP 2017-07-23 14:24:34 FALSE FALSE
SAP 2017-07-24 16:58:49 TRUE TRUE
YAS 2017-07-31 21:10:35 TRUE FALSE
我想计算 USER 列的 activity 并创建三个新列,即:1. Activity
:使用基于用户活跃程度的信息,这意味着如果同一个用户点击超过两次则称其为 TRUE,否则为 false。 2.Multiple_days:
如果用户点击网站超过一天,如果同一用户点击超过 2 天,则将该列称为 TRUE,否则为 FALSE。 3. Busniess_days:
用户是否在工作日点击,如果用户在工作日点击网站在营业时间内则称其为True else FALSE
我有以下脚本执行上述任务,但对于我的庞大数据框来说它真的很慢my data frame is 117Mb in size.
任何更好的解决方案都会很棒
我的尝试:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['date'] = [x.date() for x in df.Timestamp]
target_df = pd.DataFrame()
target_df['USER'] = df.USER.unique()
a = df.groupby(['USER', 'date']).size()
a = a[a>1]
UID=pd.DataFrame(a).reset_index().USER.values
target_df['Active'] = [True if x in UID else False for x in target_df.USER.values]
a = df.groupby('USER')['Timestamp'].nunique()
a = a[a>1]
UUID2=pd.DataFrame(a).reset_index().USER.values
target_df['Multiple_days'] = [True if x in UUID2 else False for x in target_df.USER.values]
a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()
target_df['Busi_weekday'] = [True if x in a else False for x in target_df.USER.values]
target_df.head()
USER Active Multiple_days Busi_weekday
AAD TRUE TRUE TRUE
SAP FALSE TRUE FALSE
YAS FALSE FALSE FALSE
您可以使用:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['date'] = df.Timestamp.dt.floor('d')
u = df.USER.unique()
a = df.groupby(['USER', 'date']).size().reset_index(level=1, drop=True)
a = a[a>1]
target_df = a[~a.index.duplicated()]
.astype(bool).reindex(u, fill_value=False).to_frame(name='Active')
a = df.groupby('USER')['Timestamp'].nunique()
target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)
a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()
target_df['Busi_weekday'] = target_df.index.isin(a)
print(target_df)
Active Multiple_days Busi_weekday
USER
AAD True True True
SAP False True True
YAS False False False
编辑:
自定义函数的解决方案:
print (df1)
USER Timestamp day_of_week Busi_days Busi_hours
0 AAD 2017-07-11 09:31:44 True True
1 AAD 2017-07-11 23:24:43 True False
2 AAD 2017-07-12 13:24:43 True True
3 SAP 2017-07-23 14:24:34 False False
4 SAP 2017-07-24 16:58:49 True True
5 YAS 2017-07-31 21:10:35 True False
def func(df, time_col, user_col):
df[time_col] = pd.to_datetime(df[time_col])
df['date'] = df[time_col].dt.floor('d')
u = df.USER.unique()
a = df.groupby([user_col, 'date']).size().reset_index(level=1, drop=True)
a = a[a>1]
target_df = (a[~a.index.duplicated()]
.astype(bool).reindex(u, fill_value=False).to_frame(name='Active'))
a = df.groupby(user_col)[time_col].nunique()
target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)
a = df.loc[(df.Busi_days==True)&(df.Busi_hours==True), user_col].unique()
target_df['Busi_weekday'] = target_df.index.isin(a)
return target_df
#inputs are name of DataFrame, column for timestamp and column for user
print (func(df1, 'Timestamp', 'USER'))
Active Multiple_days Busi_weekday
USER
AAD True True True
SAP False True True
YAS False False False
我有包含以下信息的在线用户数据,
df.head()
USER Timestamp day_of_week Busi_days Busi_hours
AAD 2017-07-11 09:31:44 TRUE TRUE
AAD 2017-07-11 23:24:43 TRUE FALSE
AAD 2017-07-12 13:24:43 TRUE TRUE
SAP 2017-07-23 14:24:34 FALSE FALSE
SAP 2017-07-24 16:58:49 TRUE TRUE
YAS 2017-07-31 21:10:35 TRUE FALSE
我想计算 USER 列的 activity 并创建三个新列,即:1. Activity
:使用基于用户活跃程度的信息,这意味着如果同一个用户点击超过两次则称其为 TRUE,否则为 false。 2.Multiple_days:
如果用户点击网站超过一天,如果同一用户点击超过 2 天,则将该列称为 TRUE,否则为 FALSE。 3. Busniess_days:
用户是否在工作日点击,如果用户在工作日点击网站在营业时间内则称其为True else FALSE
我有以下脚本执行上述任务,但对于我的庞大数据框来说它真的很慢my data frame is 117Mb in size.
任何更好的解决方案都会很棒
我的尝试:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['date'] = [x.date() for x in df.Timestamp]
target_df = pd.DataFrame()
target_df['USER'] = df.USER.unique()
a = df.groupby(['USER', 'date']).size()
a = a[a>1]
UID=pd.DataFrame(a).reset_index().USER.values
target_df['Active'] = [True if x in UID else False for x in target_df.USER.values]
a = df.groupby('USER')['Timestamp'].nunique()
a = a[a>1]
UUID2=pd.DataFrame(a).reset_index().USER.values
target_df['Multiple_days'] = [True if x in UUID2 else False for x in target_df.USER.values]
a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()
target_df['Busi_weekday'] = [True if x in a else False for x in target_df.USER.values]
target_df.head()
USER Active Multiple_days Busi_weekday
AAD TRUE TRUE TRUE
SAP FALSE TRUE FALSE
YAS FALSE FALSE FALSE
您可以使用:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['date'] = df.Timestamp.dt.floor('d')
u = df.USER.unique()
a = df.groupby(['USER', 'date']).size().reset_index(level=1, drop=True)
a = a[a>1]
target_df = a[~a.index.duplicated()]
.astype(bool).reindex(u, fill_value=False).to_frame(name='Active')
a = df.groupby('USER')['Timestamp'].nunique()
target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)
a = df[(df.Busi_days==True)&(df.Busi_hours==True)].USER.unique()
target_df['Busi_weekday'] = target_df.index.isin(a)
print(target_df)
Active Multiple_days Busi_weekday
USER
AAD True True True
SAP False True True
YAS False False False
编辑:
自定义函数的解决方案:
print (df1)
USER Timestamp day_of_week Busi_days Busi_hours
0 AAD 2017-07-11 09:31:44 True True
1 AAD 2017-07-11 23:24:43 True False
2 AAD 2017-07-12 13:24:43 True True
3 SAP 2017-07-23 14:24:34 False False
4 SAP 2017-07-24 16:58:49 True True
5 YAS 2017-07-31 21:10:35 True False
def func(df, time_col, user_col):
df[time_col] = pd.to_datetime(df[time_col])
df['date'] = df[time_col].dt.floor('d')
u = df.USER.unique()
a = df.groupby([user_col, 'date']).size().reset_index(level=1, drop=True)
a = a[a>1]
target_df = (a[~a.index.duplicated()]
.astype(bool).reindex(u, fill_value=False).to_frame(name='Active'))
a = df.groupby(user_col)[time_col].nunique()
target_df['Multiple_days'] = a[a>1].astype(bool).reindex(u, fill_value=False)
a = df.loc[(df.Busi_days==True)&(df.Busi_hours==True), user_col].unique()
target_df['Busi_weekday'] = target_df.index.isin(a)
return target_df
#inputs are name of DataFrame, column for timestamp and column for user
print (func(df1, 'Timestamp', 'USER'))
Active Multiple_days Busi_weekday
USER
AAD True True True
SAP False True True
YAS False False False