对于 pandas DataFrame 列中的每个唯一值，我如何随机 select 一定比例的行？

Question

Python 这里是新手。想象一个看起来像这样的 csv 文件：

(...除了在现实生活中，Person列有20个不同的名字，每个Person有300-500行。而且数据列有多个，而不是一个。)

我想要做的是随机标记每个人行的 10% 并在新列中标记它。我想出了一个非常复杂的方法来做到这一点——它涉及创建一个由随机数组成的辅助列和各种不必要的复杂游戏。它奏效了，但太疯狂了。最近，我想到了这个：

import pandas as pd 
df = pd.read_csv('source.csv')
df['selected'] = ''

names= list(df['Person'].unique())  #gets list of unique names

for name in names:
     df_temp = df[df['Person']== name]
     samp = int(len(df_temp)/10)   # I want to sample 10% for each name
     df_temp = df_temp.sample(samp)
     df_temp['selected'] = 'bingo!'   #a new column to mark the rows I've randomly selected
     df = df.merge(df_temp, how = 'left', on = ['Person','data'])
     df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
        #Note:  initially instead of the line above, I tried the line below, but it didn't work too well:
        #df['temp'] = df['selected_x'] + df['selected_y']
     df = df[['Person','data','temp']]
     df = df.rename(columns = {'temp':'selected'})

df['selected'] = df['selected'].str.replace('nan','').str.strip()  #cleans up the column

如您所见，基本上我为每个人拉出一个临时 DataFrame，使用 DF.sample(number) 进行随机化，然后使用 DF.merge 获取 'marked' 行回到原来的 DataFrame。它涉及遍历列表以创建每个临时 DataFrame ...我的理解是迭代有点蹩脚。

必须有一种更 Pythonic 的矢量化方法来做到这一点，对吧？无需迭代。也许涉及 groupby？非常感谢任何想法或建议。

编辑：这是避免 merge 的另一种方法......但它仍然很笨拙：

import pandas as pd
import math
    
   #SETUP TEST DATA:
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
    z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
    percent = 10  #CHANGE AS NEEDED
    
    #Add a 'helper' column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #this shuffles data, just to show order doesn't matter
    
    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent*0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
               
    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)

Answer 1

如果我没理解错的话，你可以使用：

df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a

Input:

       persons  col_1
    0        A      2
    1        A      2
    2        A      2
    3        A      2
    4        A      2
    5        A      2
    6        A      2
    7        A      2
    8        A      2
    9        A      2
    10       B      2
    11       B      2
    12       B      2
    13       B      2
    14       B      2
    15       B      2
    16       B      2
    17       B      2
    18       B      2
    19       B      2

Output with 50% flagged rows in each group:

     persons  col_1  flagged
0        A      2    False
1        A      2    False
2        A      2     True
3        A      2    False
4        A      2     True
5        A      2     True
6        A      2    False
7        A      2     True
8        A      2    False
9        A      2     True
10       B      2    False
11       B      2    False
12       B      2     True
13       B      2    False
14       B      2     True
15       B      2     True
16       B      2    False
17       B      2     True
18       B      2    False
19       B      2     True

Answer 2

您可以使用 groupby.sample，从整个数据帧中挑选出一个样本进行进一步处理，或者识别数据帧的行以标记是否更方便。

import pandas as pd

percentage_to_flag = 0.5

# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
#   persons  data
# 0       A     0
# 1       A     1
# 2       A     2
# 3       A     3
# 4       B     4
# 5       B     5
# 6       B     6
# 7       B     7

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
                                         random_state=random_state)
#   persons  data
# 1       A     1
# 2       A     2
# 7       B     7
# 6       B     6

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
#   persons  data  marked
# 0       A     0   False
# 1       A     1    True
# 2       A     2    True
# 3       A     3   False
# 4       B     4   False
# 5       B     5   False
# 6       B     6    True
# 7       B     7    True

如果你真的不想要子采样数据帧df_sample你可以直接标记原始数据帧的样本：

# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
                                            random_state=random_state).index,
       "marked2"] = True
#   persons  data  marked  marked2
# 0       A     0   False    False
# 1       A     1    True     True
# 2       A     2    True     True
# 3       A     3   False    False
# 4       B     4   False    False
# 5       B     5   False    False
# 6       B     6    True     True
# 7       B     7    True     True

Answer 3

这是 TMBailey 的回答，经过调整后可以在我的 Python 版本中使用。（不想编辑别人的答案，但如果我做错了，我会把它记下来。）这真的很棒而且很快！

编辑：我根据 TMBailey 的额外建议更新了此内容，将 frac=percentage_to_flag 替换为 n=math.ceil(percentage_to_flag * len(x))。这确保舍入不会将采样的 %age 拉到 'percentage_to_flag' 阈值以下。（就其价值而言，您也可以将其替换为 frac=(math.ceil(percentage_to_flag * len(x)))/len(x)）。

import pandas as pd
import math

percentage_to_flag = .10

# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True)  #had to add this to simplify multi-index DF

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

然后检查：

    pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
    pp.columns = ['no','yes']
    pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
    pp['% selected'] = 100 * pp.yes/pp.Total
    print(pp)
    
    OUTPUT:
            no   yes  Total  % selected
persons                                
Alex      2088   233   2321   10.038776
Bob       8352   929   9281   10.009697
Chuck     1810   202   2012   10.039761
Doug     30710  3413  34123   10.002051
Eddie      788    88    876   10.045662
Total    43748  4865  48613   10.007611

很有魅力。

对于 pandas DataFrame 列中的每个唯一值，我如何随机 select 一定比例的行？

For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?

python

random

vectorization

dataframe

pandas