ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas

Question

我想随机选取数据框中的 10 个组，但我遇到了这个错误。想在随机selection之前应用groupby怎么办？我尝试以下方法： random_selection=tot_groups.groupby('query_col').apply(lambda x: x.sample(3)) random_selection=tot_groups.groupby('query_col').sample(n=10)

错误： ValueError: Cannot take a larger sample than population when 'replace=False'

谢谢！

更新：

当前数据集

ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
KOX89835.1,SFN69046.1,79.07,86,18,0,1,86,12,97,1.36e-49,143.0
KOX89835.1,SFE98714.1,77.907,86,19,0,1,86,19,104,2.1400000000000002e-49,143.0
KOX89835.1,WP_086938959.1,76.471,85,20,0,1,85,4,88,1.25e-48,140.0
KOX89835.1,WP_231794161.1,76.471,85,20,0,1,85,5,89,1.75e-48,140.0
KOX89835.1,WP_231794169.1,75.294,85,21,0,1,85,5,89,2.41e-48,140.0
WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
XP_037955766.1,WP_229689219.1,93.583,374,24,0,3,376,5,378,0.0,745.0
XP_037955766.1,WP_229799179.1,93.583,374,24,0,3,376,1,374,0.0,744.0
XP_037955766.1,WP_017454560.1,92.308,377,28,1,1,376,1,377,0.0,738.0
XP_037955766.1,WP_108127780.1,92.838,377,26,1,1,376,1,377,0.0,736.0

所需输出： 随机 select 数据帧中的 n 组，groupby query_col。 IE。 n=2:

WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0

Answer 1

每个组中的

groupby's sample returns n 个元素。如果该组不包含至少 n 个元素，您将收到错误消息。

随机select组，你数一数有多少组，然后采样（不放回）n范围[0，组数）的数字，然后 return 这些行，其中组的组号等于采样的随机数。

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
grouped = tot_groups.groupby("query_col")  # suppose you want to use this

group_selectors = random.sample(range(grouped.ngroups), k=2)
ret_df = tot_groups[grouped.ngroup().isin(group_selectors)]

print(ret_df)

但是，不需要创建任何 groupby 对象。您可以收集不同 query_col 值的列表，对它们进行采样，然后 return 这些行，其中 query_col 具有正确的值：

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
unique_queries = tot_groups["query_col"].unique().tolist()
selected_queries = random.sample(unique_queries,k=2)

ret_df = tot_groups[tot_groups["query_col"].isin(selected_queries)]

print(ret_df)

ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas

ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas

python

random

dataframe

pandas-groupby