从数据框生成列表，在多个分类变量之间具有均匀的表示

Question

我正在尝试从 DF 中定义组。这些组必须基于分类变量尽可能相似。

比如我有10颗弹珠，需要做3组。我的 4 个弹珠是蓝色的，2 个是黄色的，4 个是白色的。

10 个弹珠不会平均分成 3 组，因此组大小将是 4、3、3，也就是尽可能接近偶数

同样，由于我们只有 2 种黄色，因此颜色在组之间不会有均匀的表示。但是，那些黄色弹珠必须尽可能均匀地分布在各个组中。这将在数据集中的所有分类变量中继续。

我最初的计划是只检查该行在其他组中是否存在，如果在一个组中，则尝试另一个组。我的同事指出了一种更好的生成组的方法，用一个热编码对它们进行评分，然后交换行直到一个热编码的总和接近相似的水平（表明行在每个行中包含分类变量的 "close to representative" 变化组。）他的解决方案是发布的答案。

import pandas as pd
import numpy as np
test = pd.DataFrame({'A' : ['alice', 'bob', 'george', 'michael', 'john', 'peter', 'paul', 'mary'], 
                 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                 'C' : ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat'],
                 'D' : ['boy', 'girl', 'boy', 'girl', 'boy', 'girl', 'boy', 'girl']})
gr1, gr2, gr3 = [], [], []
gr1_names = []
def test_check1(x):

    #this is where I'm clearly not approaching this problem correctly
    for index, row in x.iterrows():
        if row['A'] not in gr1 and row['B'] not in gr1 and row['C'] not in gr1 and row['D'] not in gr1:
                 gr1.extend(row) # keep a record of what names are in what groups
                 gr1_names.append(row['A']) #save the name

不过刚来这里还得会说"well if the row wasn't allowed into ANY groups just toss it into the first one. Then, the next time the row wasn't allowed into ANY groups just toss it in the second one"之类的

我发现我的示例代码无法充分处理这种情况。

我尝试了一个随机数生成器，然后制作垃圾箱，老实说，这非常接近，但我希望找到一个非随机答案。

以下是一些我认为对我今天的工作有帮助的链接： How to get all possible combinations of a list’s elements?

Get unique combinations of elements from a python list

---这个感觉非常接近，但我不知道如何将它操纵成我需要的---

预期输出将是任何形状的数据框，但所述数据框的枢轴将指示：

group id    foo bar faz
       1    3   2   5
       2    3   2   5
       3    3   1   5
       4    4   1   5

Answer 1

我的同事找到了一个解决方案，我认为这个解决方案也能更好地解释问题。

import pandas as pd
import random
import math
import itertools

def n_per_group(n, n_groups):
    """find the size of each group when splitting n people into n_groups"""
    n_per_group = math.floor(n/n_groups)
    rem = n % n_per_group
    return [n_per_group if k<rem else n_per_group + 1 for k in range(n_groups)]

def assign_groups(n, n_groups):
    """split the n people in n_groups pretty evenly, and randomize"""
    n_per = n_per_group(n ,n_groups)
    groups = list(itertools.chain(*[i[0]*[i[1]] for i in zip(n_per,list(range(n_groups)))]))
    random.shuffle(groups)
    return groups

def group_diff(df, g1, g2):
    """calculate the between group score difference"""
    a = df.loc[df['group']==g1, ~df.columns.isin(('A','group'))].sum()
    b = df.loc[df['group']==g2, ~df.columns.isin(('A','group'))].sum()
    #print(a)
    return abs(a-b).sum()

def swap_groups(df, row1, row2):
    """swap the groups of the people in row1 and row2"""
    r1group = df.loc[row1,'group']
    r2group = df.loc[row2,'group']
    df.loc[row2,'group'] = r1group
    df.loc[row1,'group'] = r2group
    return df

def row_to_group(df, row):
    """get the group associated to a given row"""
    return df.loc[row,'group']

def swap_and_score(df, row1, row2):
    """
    given two rows, calculate the between group scores
    originally, and if we swap rows. If the score difference
    is minimized by swapping, return the swapped df, otherwise
    return the orignal (swap back)
    """
    #orig = df
    g1 = row_to_group(df,row1)
    g2 = row_to_group(df,row2)
    s1 = group_diff(df,g1,g2)
    df = swap_groups(df, row1, row2)
    s2 = group_diff(df,g1,g2)
    #print(s1,s2)
    if s1>s2:
        #print('swap')
        return df
    else:
        return swap_groups(df, row1, row2)

def pairwise_scores(df):
    d = []
    for i in range(n_groups):
        for j in range(i+1,n_groups):
            d.append(group_diff(df,i,j))
    return d

# one hot encode and copy
df_dum = pd.get_dummies(df, columns=['B', 'C', 'D']).copy(deep=True)

#drop extra cols as needed

groups = assign_groups(n, n_groups)
df_dum['group'] = groups

# iterate
for _ in range(5000):
    rows = random.choices(list(range(n)),k=2)
    #print(rows)
    df_dum = swap_and_score(df_dum,rows[0],rows[1])
    #print(pairwise_scores(df))

print(pairwise_scores(df_dum))

df['group'] = df_dum.group
df['orig_groups'] = groups

for i in range(n_groups):
        for j in range(i+1,n_groups):
            a = df_dum.loc[df_dum['group']==3, ~df_dum.columns.isin(('A','group'))].sum()
            b = df_dum.loc[df_dum['group']==0, ~df_dum.columns.isin(('A','group'))].sum()
            print(a-b)

我将更改问题本身以更好地解释需要什么，因为我认为我第一次没有很好地解释最终目标。

从数据框生成列表，在多个分类变量之间具有均匀的表示

Generate lists from dataframe with even representation between multiple categorical variables

python

list

unique

itertools

pandas