根据项目的概率及其特征生成二进制结果虚拟数据

Generate binary outcome dummy data based on probability of items and its feature

我想从头开始生成一个合成数据,它是一个二进制结果序列数据 (0/1)。我的数据有以下 属性-

举个例子,假设序列中只有 3 个项目,即 A、B 和 C 所以数据是 -

给定每个项目在数据中出现时的转换概率(例如,每当在序列中遇到 A 时,结果 1 的概率约为 2%,当 B 出现时,其概率为 2.6% 等等,仅一个例子),我想随机生成数据。所以生成的数据应该是这样的 -

ID Sequence Feature Outcome

1   A->B     X       0
2   C->C->B  Y       1
3   A->B     X       1
4    A       Z       0
5   A->B->A  Z       0
6   C->C     Y       1

等等

生成此数据时,我想控制 -

有没有什么简单的方法可以让我在牢记所有这些参数的情况下生成这些数据?

虽然这可能不是最优雅的方法,但您可以使用 for 循环来实现。对于每一行,使用 .split() 将 Sequence 的那个元素拆分为事件列表。您可以使用 .count() 找到每个元素的计数。您可以使用 len() 找到长度,使用 np.sum()np.mean() 可以找到 average/total 结果。尝试使用此代码作为起点:

df['Outcome'] = 0

for i, j in df.iterrows():
    list_of_events = j['Sequence'].split('->')
    # do your calculations on list_of_events here
    print(len(list_of_events))
    print(list_of_events.count("A"))
    my_calculation_for_outcome = list_of_events.count("B")*0.02
    df.loc(i, ['Outcome']) = my_calculation_for_outcome

可能需要查看此处以确保 Outcome 列具有给定数量的真值:A fast way to find the largest N elements in an numpy array

import pandas as pd
import itertools
import numpy as np
import random


alphabets=['A','B','C']

combinations=[]
for i in range(1,len(alphabets)+1):
               combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))

weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''


df=pd.DataFrame(random.choices(
    population=combinations,weights=weights,
    k=1000000),columns=['sequence'])

# -

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20) 
plt.show()

distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers) 
plt.show()

# + tags=[]
from tqdm import tqdm

A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        count_AAA+=1
    if('A->A' in df.sequence[i]):
        count_AA+=1
    if('A' in df.sequence[i]):
        count_A+=1
    if(df.sequence[i]=='B->B->B'):
        count_BBB+=1
    if('B->B' in df.sequence[i]):
        count_BB+=1
    if('B' in df.sequence[i]):
        count_B+=1
    if(df.sequence[i]=='C->C->C'):
        count_CCC+=1
    if('C->C' in df.sequence[i]):
        count_CC+=1
    if('C' in df.sequence[i]):
        count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)

bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)

bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -

bi_BBB.sum()/count_BBB

# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
        AAA+=1
    if('A->A' in df.sequence[i]):
        df.at[i, 'Outcome_AA'] = bi_AA[AA]
        AA+=1
    if('A' in df.sequence[i]):
        df.at[i, 'Outcome_A'] = bi_A[A]
        A+=1
    if(df.sequence[i]=='B->B->B'):
        df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
        BBB+=1
    if('B->B' in df.sequence[i]):
        df.at[i, 'Outcome_BB'] = bi_BB[BB]
        BB+=1
    if('B' in df.sequence[i]):
        df.at[i, 'Outcome_B'] = bi_B[B]
        B+=1
    if(df.sequence[i]=='C->C->C'):
        df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
        CCC+=1
    if('C->C' in df.sequence[i]):
        df.at[i, 'Outcome_CC'] = bi_CC[CC]
        CC+=1
    if('C' in df.sequence[i]):
        df.at[i, 'Outcome_C'] = bi_C[C]
        C+=1
        
df=df.fillna(0)       


df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
                       x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
                       x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]