在 pandas 数据框中生成缺失的数据块
Generate missing blocks of data in pandas dataframe
所以我有一个像这样的数据框:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 301.32
2018-02-01 01:30:00 2018-02-01 01:30:00 256.68
2018-02-01 02:00:00 2018-02-01 02:00:00 245.52
2018-02-01 02:30:00 2018-02-01 02:30:00 223.20
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
我目前能做的是替换一部分值(比如随机替换 10% NaN
:
df_missing.loc[df_missing.sample(frac=0.1, random_state=100).index, 2] = np.NaN
我希望能够做的是做同样的事情,但是对于大小为 x 的随机块,假设 10% 的数据应该被阻止 NaN
。
例如,如果块大小为 4,并且比例为 30%,则上述数据帧可能如下所示:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 NaN
2018-02-01 01:30:00 2018-02-01 01:30:00 NaN
2018-02-01 02:00:00 2018-02-01 02:00:00 NaN
2018-02-01 02:30:00 2018-02-01 02:30:00 NaN
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
我发现我可以通过以下方式获取块数:
number_of_samples = int((df.shape[0] * proporition) / block_size)
但我不知道如何实际创建缺失的块。
我看过 问题,这很有用,但有两个警告:
- 它不会用 NaN 值修改原始数据帧,只是 returns 个样本。
- 不能保证样本不会重叠(我希望避免重叠)
有人可以解释一下如何为上述几点转换答案(或解释不同的解决方案)吗?
此代码使用 if
语句检查块中的重叠,以相当不优雅的方式完成工作。它还使用带参数解包 (*
) 的 chain
方法将列表列表展平为单个列表:
import pandas as pd
import random
import numpy as np
from itertools import chain
# Example dataframe
df = pd.DataFrame({0: pd.date_range(start = pd.datetime(2018, 2, 1, 0, 0, 0),
end = pd.datetime(2018, 2, 1, 10, 0, 0), freq = '30 min'),
2: np.random.randn(21)})
# Set basic parameters
proportion = 0.4
block_size = 4
number_of_samples = int((df.shape[0] * proportion) / block_size)
# This will hold all indexes to be set to NaN
block_indexes = []
i = 0
# Iterate until number of samples are found
while i < number_of_samples:
# Choose a potential start and end
potential_start = random.sample(list(df.index), 1)[0]
potential_end = potential_start + block_size
# Flatten the list of lists
flattened_indexes = list(chain(*block_indexes))
# Check to make sure potential start and potential end are not already in the indexes
if potential_start not in flattened_indexes \
and potential_end not in flattened_indexes:
# If they are not, append the block indexes
block_indexes.append(list(range(potential_start, potential_end)))
i += 1
# Flatten the list of lists
block_indexes = list(chain(*block_indexes))
# Set the blocks to nan accounting for end of dataframe
df.loc[[x for x in block_indexes if x in df.index], 2] = np.nan
将结果应用于示例数据框:
我不确定你想如何处理数据帧末尾的块,但这段代码会忽略出现在数据帧索引范围之外的任何索引。我确信有一种更 Pythonic 的方式来编写这段代码,如有任何评论,我们将不胜感激!
@caseWestern 给出了一个很好的解决方案,我自己的解决方案在某种程度上是这样的:
def block_sample(df_length : int, number_of_samples : int, block_size : int):
""" Generates the the initial index of a block of block_size WITHOUT replacement.
Does this by removing x-(block_size+1):x+block_size from the possible values,
so that the next value must be at least a block_size away from the last value.
Raises
------
ValueError: In cases of more samples than possible.
"""
full_range = list(range(df_length))
for _ in range(number_of_samples):
x = random.sample(full_range, 1)[0]
indx = full_range.index(x)
yield x
del full_range[indx-(block_size-1):indx+block_size]
try:
for x in block_sample(df_length, number_of_samples, block_size):
df_missing.loc[x:x+block_size, 2] = np.NaN
except ValueError:
pass
所以我有一个像这样的数据框:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 301.32
2018-02-01 01:30:00 2018-02-01 01:30:00 256.68
2018-02-01 02:00:00 2018-02-01 02:00:00 245.52
2018-02-01 02:30:00 2018-02-01 02:30:00 223.20
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
我目前能做的是替换一部分值(比如随机替换 10% NaN
:
df_missing.loc[df_missing.sample(frac=0.1, random_state=100).index, 2] = np.NaN
我希望能够做的是做同样的事情,但是对于大小为 x 的随机块,假设 10% 的数据应该被阻止 NaN
。
例如,如果块大小为 4,并且比例为 30%,则上述数据帧可能如下所示:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 NaN
2018-02-01 01:30:00 2018-02-01 01:30:00 NaN
2018-02-01 02:00:00 2018-02-01 02:00:00 NaN
2018-02-01 02:30:00 2018-02-01 02:30:00 NaN
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
我发现我可以通过以下方式获取块数:
number_of_samples = int((df.shape[0] * proporition) / block_size)
但我不知道如何实际创建缺失的块。
我看过
- 它不会用 NaN 值修改原始数据帧,只是 returns 个样本。
- 不能保证样本不会重叠(我希望避免重叠)
有人可以解释一下如何为上述几点转换答案(或解释不同的解决方案)吗?
此代码使用 if
语句检查块中的重叠,以相当不优雅的方式完成工作。它还使用带参数解包 (*
) 的 chain
方法将列表列表展平为单个列表:
import pandas as pd
import random
import numpy as np
from itertools import chain
# Example dataframe
df = pd.DataFrame({0: pd.date_range(start = pd.datetime(2018, 2, 1, 0, 0, 0),
end = pd.datetime(2018, 2, 1, 10, 0, 0), freq = '30 min'),
2: np.random.randn(21)})
# Set basic parameters
proportion = 0.4
block_size = 4
number_of_samples = int((df.shape[0] * proportion) / block_size)
# This will hold all indexes to be set to NaN
block_indexes = []
i = 0
# Iterate until number of samples are found
while i < number_of_samples:
# Choose a potential start and end
potential_start = random.sample(list(df.index), 1)[0]
potential_end = potential_start + block_size
# Flatten the list of lists
flattened_indexes = list(chain(*block_indexes))
# Check to make sure potential start and potential end are not already in the indexes
if potential_start not in flattened_indexes \
and potential_end not in flattened_indexes:
# If they are not, append the block indexes
block_indexes.append(list(range(potential_start, potential_end)))
i += 1
# Flatten the list of lists
block_indexes = list(chain(*block_indexes))
# Set the blocks to nan accounting for end of dataframe
df.loc[[x for x in block_indexes if x in df.index], 2] = np.nan
将结果应用于示例数据框:
我不确定你想如何处理数据帧末尾的块,但这段代码会忽略出现在数据帧索引范围之外的任何索引。我确信有一种更 Pythonic 的方式来编写这段代码,如有任何评论,我们将不胜感激!
@caseWestern 给出了一个很好的解决方案,我自己的解决方案在某种程度上是这样的:
def block_sample(df_length : int, number_of_samples : int, block_size : int):
""" Generates the the initial index of a block of block_size WITHOUT replacement.
Does this by removing x-(block_size+1):x+block_size from the possible values,
so that the next value must be at least a block_size away from the last value.
Raises
------
ValueError: In cases of more samples than possible.
"""
full_range = list(range(df_length))
for _ in range(number_of_samples):
x = random.sample(full_range, 1)[0]
indx = full_range.index(x)
yield x
del full_range[indx-(block_size-1):indx+block_size]
try:
for x in block_sample(df_length, number_of_samples, block_size):
df_missing.loc[x:x+block_size, 2] = np.NaN
except ValueError:
pass