ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas

ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas

我想随机选取数据框中的 10 个组,但我遇到了这个错误。 想在随机selection之前应用groupby怎么办? 我尝试以下方法: random_selection=tot_groups.groupby('query_col').apply(lambda x: x.sample(3)) random_selection=tot_groups.groupby('query_col').sample(n=10)

错误: ValueError: Cannot take a larger sample than population when 'replace=False'

谢谢!

更新:

当前数据集

ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
KOX89835.1,SFN69046.1,79.07,86,18,0,1,86,12,97,1.36e-49,143.0
KOX89835.1,SFE98714.1,77.907,86,19,0,1,86,19,104,2.1400000000000002e-49,143.0
KOX89835.1,WP_086938959.1,76.471,85,20,0,1,85,4,88,1.25e-48,140.0
KOX89835.1,WP_231794161.1,76.471,85,20,0,1,85,5,89,1.75e-48,140.0
KOX89835.1,WP_231794169.1,75.294,85,21,0,1,85,5,89,2.41e-48,140.0
WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
XP_037955766.1,WP_229689219.1,93.583,374,24,0,3,376,5,378,0.0,745.0
XP_037955766.1,WP_229799179.1,93.583,374,24,0,3,376,1,374,0.0,744.0
XP_037955766.1,WP_017454560.1,92.308,377,28,1,1,376,1,377,0.0,738.0
XP_037955766.1,WP_108127780.1,92.838,377,26,1,1,376,1,377,0.0,736.0

所需输出: 随机 select 数据帧中的 n 组,groupby query_col。 IE。 n=2:

WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
每个组中的

groupby's sample returns n 个元素。如果该组不包含至少 n 个元素,您将收到错误消息。

随机select组,你数一数有多少组,然后采样(不放回)n范围[0,组数)的数字,然后 return 这些行,其中组的组号等于采样的随机数。

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
grouped = tot_groups.groupby("query_col")  # suppose you want to use this

group_selectors = random.sample(range(grouped.ngroups), k=2)
ret_df = tot_groups[grouped.ngroup().isin(group_selectors)]

print(ret_df)

但是,不需要创建任何 groupby 对象。您可以收集不同 query_col 值的列表,对它们进行采样,然后 return 这些行,其中 query_col 具有正确的值:

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
unique_queries = tot_groups["query_col"].unique().tolist()
selected_queries = random.sample(unique_queries,k=2)

ret_df = tot_groups[tot_groups["query_col"].isin(selected_queries)]

print(ret_df)