Select 数据框中的行基于其他列的唯一值？

Question

我的数据框的其中一列的值如下所示：

air_voice_no_null.loc[:,"host_has_profile_pic"].value_counts(normalize = True)*100

1.0    99.694276
0.0     0.305724
Name: host_has_profile_pic, dtype: float64

对于该列中的每个唯一值，那是 99:1。

我现在想创建一个新的数据框，使其具有来自该数据框的 60% 的 1.0 和 40% 的 0.0 以及所有行（当然行数更少）。

我尝试使用 train_test_split class 的 sklearn.model_selection 的 strat 函数拆分它，如下所示，但没有运气获得每个数据帧的比例相等独特的价值。

from sklearn.model_selection import train_test_split

profile_train_x, profile_test_x, profile_train_y, profile_test_y = train_test_split(air_voice_no_null.loc[:,['log_price', 'accommodates', 'bathrooms','host_response_rate', 'number_of_reviews', 'review_scores_rating','bedrooms', 'beds', 'cleaning_fee', 'instant_bookable']],
                                                                                   air_voice_no_null.loc[:,"host_has_profile_pic"],
                                                                                   random_state=42, stratify=air_voice_no_null.loc[:,"host_has_profile_pic"])

这就是上面代码的结果，行数没有变化。

print(profile_train_x.shape)
print(profile_test_x.shape)
print(profile_train_y.shape)
print(profile_test_y.shape)

(55442, 10)
(18481, 10)
(55442,)
(18481,)

我如何 select 我的数据集的子集行数减少，同时保持 host_has_profile_pic 变量的每个 class 的适当比例。

link 到完整的数据集：https://www.kaggle.com/stevezhenghp/airbnb-price-prediction

Answer 1

考虑以下方式：

import pandas as pd

# create some data
df = pd.DataFrame({'a': [0] * 10 + [1] * 90})

print('original proportion:')
print(df['a'].value_counts(normalize=True))

# take samples for every unique value separately
df_new = pd.concat([
    df[df['a'] == 0].sample(frac=.4),
    df[df['a'] == 1].sample(frac=.07)])

print('\nsample proportion:')
print(df_new['a'].value_counts(normalize=True))

输出：

original proportion:
1    0.9
0    0.1
Name: a, dtype: float64

sample proportion:
1    0.6
0    0.4
Name: a, dtype: float64

Select 数据框中的行基于其他列的唯一值？

Select rows from dataframe based on a unique values of other column?

python

dataframe

pandas

scikit-learn

data-transform