使用日期逻辑创建假数据
Creating fake data using date logic
我正在尝试将虚假数据插入此 table。它不能完全随机,因为行需要有意义。我会在下面解释。
我的数据是这样的:
AcctID
account_status
start_date
end_date
C382861922
ACTIVE
2016-05-25
None
C382861922
INACTIVE
None
None
C382861922
ACTIVE
None
None
C382861922
INACTIVE
None
2021-12-31
C429768513
ACTIVE
2015-12-27
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
2021-12-31
C643625629
ACTIVE
2016-07-24
None
C643625629
INACTIVE
None
None
C643625629
ACTIVE
None
2021-12-31
C82157435
ACTIVE
2016-10-22
None
C82157435
INACTIVE
None
2021-12-31
每个 AcctID 都可以出现多次,但仅用一个 AcctID 出现两次的例子来解释我在做什么是最容易的:
AcctID
account_status
start_date
end_date
C82157435
ACTIVE
2016-10-22
None
C82157435
INACTIVE
None
2021-12-31
我的目标是随机选择一个日期,在此客户更改了他们的 account_status,这将成为 end_date第一行和第二行的start_date。所以,我只需要选择 1 个随机日期,并将其插入到两个地方。很简单 - 我可以 max() 和 min() 然后计算天数的差异,然后在该范围内选择一个随机整数。
但是,我不知道如何为使用超过 2 条记录的客户做这件事:
AcctID
account_status
start_date
end_date
C429768513
ACTIVE
2015-12-27
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
None
C429768513
ACTIVE
None
None
C429768513
INACTIVE
None
2021-12-31
有几个地方可以随机选择一个日期,但由于需要相互对应,问题就变得很复杂了。有什么想法吗?
下面是创建示例数据框的代码:
import pandas as pd
fake = [
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "2015-12-27",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "2021-12-31"
}
]
df = pd.DataFrame(fake)
编辑:
这是程序输出的伪造示例。请注意大部分日期是随机选择的 - 但前一行的结束日期与下一行的开始日期相匹配。
AcctID
account_status
start_date
end_date
C429768513
ACTIVE
2015-12-27
2016-01-05
C429768513
INACTIVE
2016-01-05
2016-03-01
C429768513
ACTIVE
2016-03-01
2017-06-22
C429768513
INACTIVE
2017-06-22
2017-09-04
C429768513
ACTIVE
2017-09-04
2018-10-27
C429768513
INACTIVE
2018-10-27
2019-04-04
C429768513
ACTIVE
2019-04-04
2020-06-06
C429768513
INACTIVE
2020-06-06
2021-12-31
解决这个问题的一种方法:
df = df.replace(to_replace='None', value=np.nan)
def random_date(x):
s_d = pd.to_datetime(x[x['start_date'].notna()]['start_date'])
e_d = pd.to_datetime(x[x['end_date'].notna()]['end_date'])
start_u = s_d.iloc[0].value//10**9
end_u = e_d.iloc[0].value//10**9
end_date_list = sorted(pd.to_datetime(np.random.randint(start_u, end_u, len(x)-1), unit='s').values)
end_date_list = np.append(end_date_list, e_d.values)
x['end_date'] = end_date_list
mask = x['start_date'].isna()
x.loc[mask,'start_date'] = x.shift(1).loc[mask]['end_date'].astype(str)
x['start_date'] = pd.to_datetime(x['start_date']).dt.date
x['end_date'] = pd.to_datetime(x['end_date']).dt.date
return x
df = df.groupby('AcctID').apply(random_date)
输出:
AcctID account_status start_date end_date
0 C382861922 ACTIVE 2016-05-25 2016-12-23
1 C382861922 INACTIVE 2016-12-23 2017-12-28
2 C382861922 ACTIVE 2017-12-28 2019-04-24
3 C382861922 INACTIVE 2019-04-24 2021-12-31
4 C429768513 ACTIVE 2015-12-27 2017-12-04
5 C429768513 INACTIVE 2017-12-04 2019-01-07
6 C429768513 ACTIVE 2019-01-07 2019-04-03
7 C429768513 INACTIVE 2019-04-03 2020-06-13
8 C429768513 ACTIVE 2020-06-13 2021-02-13
9 C429768513 INACTIVE 2021-02-13 2021-03-09
10 C429768513 ACTIVE 2021-03-09 2021-08-09
11 C429768513 INACTIVE 2021-08-09 2021-12-31
12 C643625629 ACTIVE 2016-07-24 2021-02-27
13 C643625629 INACTIVE 2021-02-27 2021-05-20
14 C643625629 ACTIVE 2021-05-20 2021-12-31
15 C82157435 ACTIVE 2016-10-22 2021-02-20
16 C82157435 INACTIVE 2021-02-20 2021-12-31
我正在尝试将虚假数据插入此 table。它不能完全随机,因为行需要有意义。我会在下面解释。
我的数据是这样的:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C382861922 | ACTIVE | 2016-05-25 | None |
C382861922 | INACTIVE | None | None |
C382861922 | ACTIVE | None | None |
C382861922 | INACTIVE | None | 2021-12-31 |
C429768513 | ACTIVE | 2015-12-27 | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | 2021-12-31 |
C643625629 | ACTIVE | 2016-07-24 | None |
C643625629 | INACTIVE | None | None |
C643625629 | ACTIVE | None | 2021-12-31 |
C82157435 | ACTIVE | 2016-10-22 | None |
C82157435 | INACTIVE | None | 2021-12-31 |
每个 AcctID 都可以出现多次,但仅用一个 AcctID 出现两次的例子来解释我在做什么是最容易的:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C82157435 | ACTIVE | 2016-10-22 | None |
C82157435 | INACTIVE | None | 2021-12-31 |
我的目标是随机选择一个日期,在此客户更改了他们的 account_status,这将成为 end_date第一行和第二行的start_date。所以,我只需要选择 1 个随机日期,并将其插入到两个地方。很简单 - 我可以 max() 和 min() 然后计算天数的差异,然后在该范围内选择一个随机整数。
但是,我不知道如何为使用超过 2 条记录的客户做这件事:
AcctID | account_status | start_date | end_date |
---|---|---|---|
C429768513 | ACTIVE | 2015-12-27 | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | None |
C429768513 | ACTIVE | None | None |
C429768513 | INACTIVE | None | 2021-12-31 |
有几个地方可以随机选择一个日期,但由于需要相互对应,问题就变得很复杂了。有什么想法吗?
下面是创建示例数据框的代码:
import pandas as pd
fake = [
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "2015-12-27",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "ACTIVE",
"start_date": "None",
"end_date": "None"
},
{
"AcctID": "C429768513",
"account_status": "INACTIVE",
"start_date": "None",
"end_date": "2021-12-31"
}
]
df = pd.DataFrame(fake)
编辑: 这是程序输出的伪造示例。请注意大部分日期是随机选择的 - 但前一行的结束日期与下一行的开始日期相匹配。
AcctID | account_status | start_date | end_date |
---|---|---|---|
C429768513 | ACTIVE | 2015-12-27 | 2016-01-05 |
C429768513 | INACTIVE | 2016-01-05 | 2016-03-01 |
C429768513 | ACTIVE | 2016-03-01 | 2017-06-22 |
C429768513 | INACTIVE | 2017-06-22 | 2017-09-04 |
C429768513 | ACTIVE | 2017-09-04 | 2018-10-27 |
C429768513 | INACTIVE | 2018-10-27 | 2019-04-04 |
C429768513 | ACTIVE | 2019-04-04 | 2020-06-06 |
C429768513 | INACTIVE | 2020-06-06 | 2021-12-31 |
解决这个问题的一种方法:
df = df.replace(to_replace='None', value=np.nan)
def random_date(x):
s_d = pd.to_datetime(x[x['start_date'].notna()]['start_date'])
e_d = pd.to_datetime(x[x['end_date'].notna()]['end_date'])
start_u = s_d.iloc[0].value//10**9
end_u = e_d.iloc[0].value//10**9
end_date_list = sorted(pd.to_datetime(np.random.randint(start_u, end_u, len(x)-1), unit='s').values)
end_date_list = np.append(end_date_list, e_d.values)
x['end_date'] = end_date_list
mask = x['start_date'].isna()
x.loc[mask,'start_date'] = x.shift(1).loc[mask]['end_date'].astype(str)
x['start_date'] = pd.to_datetime(x['start_date']).dt.date
x['end_date'] = pd.to_datetime(x['end_date']).dt.date
return x
df = df.groupby('AcctID').apply(random_date)
输出:
AcctID account_status start_date end_date
0 C382861922 ACTIVE 2016-05-25 2016-12-23
1 C382861922 INACTIVE 2016-12-23 2017-12-28
2 C382861922 ACTIVE 2017-12-28 2019-04-24
3 C382861922 INACTIVE 2019-04-24 2021-12-31
4 C429768513 ACTIVE 2015-12-27 2017-12-04
5 C429768513 INACTIVE 2017-12-04 2019-01-07
6 C429768513 ACTIVE 2019-01-07 2019-04-03
7 C429768513 INACTIVE 2019-04-03 2020-06-13
8 C429768513 ACTIVE 2020-06-13 2021-02-13
9 C429768513 INACTIVE 2021-02-13 2021-03-09
10 C429768513 ACTIVE 2021-03-09 2021-08-09
11 C429768513 INACTIVE 2021-08-09 2021-12-31
12 C643625629 ACTIVE 2016-07-24 2021-02-27
13 C643625629 INACTIVE 2021-02-27 2021-05-20
14 C643625629 ACTIVE 2021-05-20 2021-12-31
15 C82157435 ACTIVE 2016-10-22 2021-02-20
16 C82157435 INACTIVE 2021-02-20 2021-12-31