从静态数据集中采样以创建数据框，忽略 Python 中的索引

Question

我正在尝试从静态数据帧创建一些随机样本（给定大小）。目标是为每个样本创建多个列（并且抽取的每个样本大小相同）。我期望在完全采样的数据框中看到相同长度（即样本大小）的多列，但也许追加不是正确的方法。这是代码：

# create sample dataframe
    target_df = pd.DataFrame(np.arange(1000))
    target_df.columns=['pl']

# create the sampler:

    sample_num = 5
    sample_len = 10
    df_max_row = len(target_df) - sample_len 
      

 for i in range(sample_num):
        rndm_start = np.random.choice(df_max_row, 1)[0]
        rndm_end = rndm_start + sample_len
        slicer = target_df.iloc[rndm_start:rndm_end]['pl']
        
sampled_df = sampled_df.append(slicer, ignore_index=True)
sampled_df = sampled_df.T

其输出如下图所示 - 红线显示我要删除的索引。

所需的输出如下所示。我该如何做到这一点？

谢谢！

Answer 1

我会使用

创建新列

sampled_df[i] = slicer.reset_index(drop=True)

最终我会使用 str(i) 作为列名，因为稍后使用字符串 select 列比使用数字

更简单

import pandas as pd
import random

target_df = pd.DataFrame({'pl': range(1000)})

# create the sampler:

sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len 

sampled_df = pd.DataFrame()

for i in range(1, sample_num+1):
    start = random.randint(0, df_max_row)
    end   = start + sample_len
    slicer = target_df[start:end]['pl']
    sampled_df[str(i)] = slicer.reset_index(drop=True)

sampled_df.index += 1 
print(sampled_df)

结果：

      1    2    3    4    5
1   735  396  646  534  769
2   736  397  647  535  770
3   737  398  648  536  771
4   738  399  649  537  772
5   739  400  650  538  773
6   740  401  651  539  774
7   741  402  652  540  775
8   742  403  653  541  776
9   743  404  654  542  777
10  744  405  655  543  778

但要创建真正随机的值，我会先打乱值

np.random.shuffle(target_df['pl'])

然后我不必使用 random 到 select start

shuffle 更改原始列，因此无法分配给新变量。

它不会重复样本中的值。

import pandas as pd
#import numpy as np
import random

target_df = pd.DataFrame({'pl': range(1000)})

# create the sampler:

sample_num = 5
sample_len = 10

sampled_df = pd.DataFrame()

#np.random.shuffle(target_df['pl'])
random.shuffle(target_df['pl'])

for i in range(1, sample_num+1):
    start = i * sample_len
    end   = start + sample_len
    slicer = target_df[start:end]['pl']
    sampled_df[str(i)] = slicer.reset_index(drop=True)

sampled_df.index += 1 
print(sampled_df)

结果：

      1    2    3    4    5
1   638  331  171  989  170
2    22  643   47  136  764
3   969  455  211  763  194
4   859  384  174  552  566
5   221  829   62  926  414
6     4  895  951  967  381
7   758  688  594  876  873
8   757  691  825  693  707
9   235  353   34  699  121
10  447   81   36  682  251

如果值可以重复，那么您可以使用

sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)

import pandas as pd

target_df = pd.DataFrame({'pl': range(1000)})

# create the sampler:

sample_num = 5
sample_len = 10

sampled_df = pd.DataFrame()

for i in range(1, sample_num+1):
    sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)

sampled_df.index += 1 
print(sampled_df)

编辑

您还可以将打乱后的值设为 numpy array 并使用 reshape - 然后再转换回包含许多列的 DataFrame。稍后你可以获得一些专栏。

import pandas as pd
import random

target_df = pd.DataFrame({'pl': range(1000)})

# create the sampler:

sample_num = 5
sample_len = 10

random.shuffle(target_df['pl'])

sampled_df = pd.DataFrame(target_df['pl'].values.reshape([sample_len,-1]))

sampled_df = sampled_df.iloc[:, 0:sample_num]

sampled_df.index += 1
print(sampled_df)

从静态数据集中采样以创建数据框，忽略 Python 中的索引

Sampling from static data set to create dataframe, ignore index in Python

python

append

dataframe