样本大小大于 DataFrame 长度的样本行

Question

我被要求根据旧变量的数据生成一个新变量。基本上，所要求的是我从原始值中随机取值（通过使用 random 函数）并且观察值至少是旧值的 10 倍，然后将其保存为新变量。

这是我的数据集：https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

我想使用的变量是 area

这是我的尝试，但它给了我一个 module object is not callable 错误：

import pandas as pd
import random as rand

dataFrame = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")

area = dataFrame['area']

random_area = rand(area)

print(random_area)

Answer 1

您可以将 sample 函数与 replace=True 一起使用：

df = df.sample(n=len(df) * 10, replace=True)

或者，要对仅区域列进行采样，请使用

area = df.area.sample(n=len(df) * 10, replace=True)

另一种选择涉及 np.random.choice，看起来像：

df = df.iloc[np.random.choice(len(df), len(df) * 10)]

想法是从 0-len(df)-1 生成随机索引。第一个参数指定上限，第二个 (len(df) * 10) 指定要生成的索引数。然后我们使用生成的索引索引到 df.

如果你只想得到area，这就足够了。

area = df.iloc[np.random.choice(len(df), len(df) * 10), df.columns.get_loc('area')]

Index.get_loc 将 "area" 标签转换为位置，对于 iloc.

df = pd.DataFrame({'A': list('aab'), 'B': list('123')})
df
   A  B
0  a  1
1  a  2
2  b  3

# Sample 3 times the original size
df.sample(n=len(df) * 3, replace=True)

   A  B
2  b  3
1  a  2
1  a  2
2  b  3
1  a  2
0  a  1
0  a  1
2  b  3
2  b  3

df.iloc[np.random.choice(len(df), len(df) * 3)]

   A  B
0  a  1
1  a  2
1  a  2
0  a  1
2  b  3
0  a  1
0  a  1
0  a  1
2  b  3

样本大小大于 DataFrame 长度的样本行

Sampling rows with sample size greater than length of DataFrame

python

random

sample

dataframe

pandas