对于 pandas DataFrame 列中的每个唯一值,我如何随机 select 一定比例的行?
For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?
Python 这里是新手。
想象一个看起来像这样的 csv 文件:
(...除了在现实生活中,Person列有20个不同的名字,每个Person有300-500行。而且数据列有多个,而不是一个。)
我想要做的是随机 标记每个人行的 10% 并在新列中标记它。我想出了一个非常复杂的方法来做到这一点——它涉及创建一个由随机数组成的辅助列和各种不必要的复杂游戏。它奏效了,但太疯狂了。最近,我想到了这个:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
如您所见,基本上我为每个人拉出一个临时 DataFrame,使用 DF.sample(number)
进行随机化,然后使用 DF.merge
获取 'marked' 行回到原来的 DataFrame。它涉及遍历列表以创建每个临时 DataFrame ...我的理解是迭代有点蹩脚。
必须有一种更 Pythonic 的矢量化方法来做到这一点,对吧?无需迭代。也许涉及 groupby
?非常感谢任何想法或建议。
编辑:这是避免 merge
的另一种方法......但它仍然很笨拙:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
如果我没理解错的话,你可以使用:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 False
1 A 2 False
2 A 2 True
3 A 2 False
4 A 2 True
5 A 2 True
6 A 2 False
7 A 2 True
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 False
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 True
您可以使用 groupby.sample
,从整个数据帧中挑选出一个样本进行进一步处理,或者识别数据帧的行以标记是否更方便。
import pandas as pd
percentage_to_flag = 0.5
# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
# persons data
# 0 A 0
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 B 7
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
random_state=random_state)
# persons data
# 1 A 1
# 2 A 2
# 7 B 7
# 6 B 6
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
# persons data marked
# 0 A 0 False
# 1 A 1 True
# 2 A 2 True
# 3 A 3 False
# 4 B 4 False
# 5 B 5 False
# 6 B 6 True
# 7 B 7 True
如果你真的不想要子采样数据帧df_sample
你可以直接标记原始数据帧的样本:
# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
random_state=random_state).index,
"marked2"] = True
# persons data marked marked2
# 0 A 0 False False
# 1 A 1 True True
# 2 A 2 True True
# 3 A 3 False False
# 4 B 4 False False
# 5 B 5 False False
# 6 B 6 True True
# 7 B 7 True True
这是 TMBailey 的回答,经过调整后可以在我的 Python 版本中使用。 (不想编辑别人的答案,但如果我做错了,我会把它记下来。)这真的很棒而且很快!
编辑:我根据 TMBailey 的额外建议更新了此内容,将 frac=percentage_to_flag
替换为 n=math.ceil(percentage_to_flag * len(x))
。这确保舍入不会将采样的 %age 拉到 'percentage_to_flag' 阈值以下。 (就其价值而言,您也可以将其替换为 frac=(math.ceil(percentage_to_flag * len(x)))/len(x)
)。
import pandas as pd
import math
percentage_to_flag = .10
# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True) #had to add this to simplify multi-index DF
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
然后检查:
pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)
OUTPUT:
no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Eddie 788 88 876 10.045662
Total 43748 4865 48613 10.007611
很有魅力。
Python 这里是新手。 想象一个看起来像这样的 csv 文件:
(...除了在现实生活中,Person列有20个不同的名字,每个Person有300-500行。而且数据列有多个,而不是一个。)
我想要做的是随机 标记每个人行的 10% 并在新列中标记它。我想出了一个非常复杂的方法来做到这一点——它涉及创建一个由随机数组成的辅助列和各种不必要的复杂游戏。它奏效了,但太疯狂了。最近,我想到了这个:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
如您所见,基本上我为每个人拉出一个临时 DataFrame,使用 DF.sample(number)
进行随机化,然后使用 DF.merge
获取 'marked' 行回到原来的 DataFrame。它涉及遍历列表以创建每个临时 DataFrame ...我的理解是迭代有点蹩脚。
必须有一种更 Pythonic 的矢量化方法来做到这一点,对吧?无需迭代。也许涉及 groupby
?非常感谢任何想法或建议。
编辑:这是避免 merge
的另一种方法......但它仍然很笨拙:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
如果我没理解错的话,你可以使用:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 False
1 A 2 False
2 A 2 True
3 A 2 False
4 A 2 True
5 A 2 True
6 A 2 False
7 A 2 True
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 False
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 True
您可以使用 groupby.sample
,从整个数据帧中挑选出一个样本进行进一步处理,或者识别数据帧的行以标记是否更方便。
import pandas as pd
percentage_to_flag = 0.5
# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
# persons data
# 0 A 0
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 B 7
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
random_state=random_state)
# persons data
# 1 A 1
# 2 A 2
# 7 B 7
# 6 B 6
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
# persons data marked
# 0 A 0 False
# 1 A 1 True
# 2 A 2 True
# 3 A 3 False
# 4 B 4 False
# 5 B 5 False
# 6 B 6 True
# 7 B 7 True
如果你真的不想要子采样数据帧df_sample
你可以直接标记原始数据帧的样本:
# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
random_state=random_state).index,
"marked2"] = True
# persons data marked marked2
# 0 A 0 False False
# 1 A 1 True True
# 2 A 2 True True
# 3 A 3 False False
# 4 B 4 False False
# 5 B 5 False False
# 6 B 6 True True
# 7 B 7 True True
这是 TMBailey 的回答,经过调整后可以在我的 Python 版本中使用。 (不想编辑别人的答案,但如果我做错了,我会把它记下来。)这真的很棒而且很快!
编辑:我根据 TMBailey 的额外建议更新了此内容,将 frac=percentage_to_flag
替换为 n=math.ceil(percentage_to_flag * len(x))
。这确保舍入不会将采样的 %age 拉到 'percentage_to_flag' 阈值以下。 (就其价值而言,您也可以将其替换为 frac=(math.ceil(percentage_to_flag * len(x)))/len(x)
)。
import pandas as pd
import math
percentage_to_flag = .10
# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True) #had to add this to simplify multi-index DF
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
然后检查:
pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)
OUTPUT:
no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Eddie 788 88 876 10.045662
Total 43748 4865 48613 10.007611
很有魅力。