Numpy Select 默认条件 Returns 错误值
Numpy Select Default Condition Returns Wrong Value
我有以下代码:
datetime_const = datetime(2021, 3, 31)
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime1'], format='%Y-%m-%d')
tmp_df1['test_col_1'] = (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12)))
tmp_df1['test_col_2'] = (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
tmp_df1['test_col_3'] = datetime_const + pd.DateOffset(months=12)
tmp_df1['test_col_4'] = datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
tmp_df1['test_col_5'] = tmp_df1['datetime2']
tmp_df1['datetime3'] = np.select(
[
(tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
(tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
],
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
default=tmp_df1['datetime2']
)
datetime1 是一个对象 dtype,所以我将它转换为 datetime64,因为 datetime2 被指定为。
value1 是一个带有一堆十进制数的 float dtype 列,它确实有 NaN。
我创建了 test_col_1 到 test_col_5 来检查我的 np.select 函数中的各个条件和选择,当分配为单独的 df 列时它们看起来都是正确的。
然而,我的 datetime3 列赋值,来自 np.select 函数,returns 一些奇怪的对象 dtype 大数字,比如 160000000000。我希望它 return 或者 datetime64两个选项之一的值,或默认的 datetime2 列值。
请查看下面的示例 .info 和 df 行:
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime2 26558 non-null datetime64[ns]
1 value1 25438 non-null float64
2 test_col_1 26558 non-null bool
3 test_col_2 26558 non-null bool
4 test_col_3 26558 non-null datetime64[ns]
5 test_col_4 25438 non-null datetime64[ns]
6 test_col_5 26558 non-null datetime64[ns]
7 datetime3 26558 non-null object
dtypes: bool(2), datetime64[ns](4), float64(1), object(1)
memory usage: 1.5+ MB
datetime2 value1 test_col_1 test_col_2 test_col_3 test_col_4 test_col_5 datetime3
0 2021-06-30 0.00058 False True 2022-03-31 2021-08-05 2021-06-30 1628121600000000000
1 2022-03-31 0.00044 False False 2022-03-31 2021-09-13 2022-03-31 1648684800000000000
2 2024-06-07 0.00860 False False 2022-03-31 2021-04-08 2024-06-07 1717718400000000000
3 2021-09-30 0.00867 False False 2022-03-31 2021-04-08 2021-09-30 1632960000000000000
4 2021-08-31 0.00144 False False 2022-03-31 2021-05-21 2021-08-31 1630368000000000000
5 2021-08-31 0.00144 False False 2022-03-31 2021-05-21 2021-08-31 1630368000000000000
6 2021-04-08 0.00474 False True 2022-03-31 2021-04-15 2021-04-08 1618444800000000000
7 2023-10-01 0.11506 False False 2022-03-31 2021-04-01 2023-10-01 1696118400000000000
8 2023-09-29 0.12067 False False 2022-03-31 2021-04-01 2023-09-29 1695945600000000000
9 2021-05-31 0.02508 False False 2022-03-31 2021-04-03 2021-05-31 1622419200000000000
我完全被这种行为搞糊涂了,请赐教!
先谢谢大家了!
使用 np.select
时,似乎将日期从纪元时间转换为 int64 表示形式。一个简单的解决方法是在 astype
之后进行转换
# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')
tmp_df1['datetime3'] = np.select(
[
(tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
(tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
],
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
default=tmp_df1['datetime2']
).astype('datetime64[ns]') ### <--- add this
print(tmp_df1)
datetime2 value1 datetime3
0 2021-06-30 0.00058 2021-08-04
1 2023-10-01 0.11506 2023-10-01
更长的解释
我认为问题出在你的两个选择上,因为其中一个是单个值(第一个),第二个是一个系列。您可以看到当第二个选择也是系列时它也有效(使用 datetime dtype)
# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')
如果我使用你的方法我得到长整数表示(像你一样)
np.select(
...
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],...
)
# gives
array([1628035200000000000, 1696118400000000000], dtype=object)
但通过创建系列(与您的用例无关)替换第一个选择中的 datetime_const
np.select(
...
[
tmp_df1['datetime2'] + pd.DateOffset(months=12), # here replace the constant by the column datetime2 for example
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
...
)
# get the good date format (wrong value of course)
array(['2021-08-04T00:00:00.000000000', '2023-10-01T00:00:00.000000000'],
dtype='datetime64[ns]')
我有以下代码:
datetime_const = datetime(2021, 3, 31)
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime1'], format='%Y-%m-%d')
tmp_df1['test_col_1'] = (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12)))
tmp_df1['test_col_2'] = (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
tmp_df1['test_col_3'] = datetime_const + pd.DateOffset(months=12)
tmp_df1['test_col_4'] = datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
tmp_df1['test_col_5'] = tmp_df1['datetime2']
tmp_df1['datetime3'] = np.select(
[
(tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
(tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
],
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
default=tmp_df1['datetime2']
)
datetime1 是一个对象 dtype,所以我将它转换为 datetime64,因为 datetime2 被指定为。
value1 是一个带有一堆十进制数的 float dtype 列,它确实有 NaN。
我创建了 test_col_1 到 test_col_5 来检查我的 np.select 函数中的各个条件和选择,当分配为单独的 df 列时它们看起来都是正确的。
然而,我的 datetime3 列赋值,来自 np.select 函数,returns 一些奇怪的对象 dtype 大数字,比如 160000000000。我希望它 return 或者 datetime64两个选项之一的值,或默认的 datetime2 列值。
请查看下面的示例 .info 和 df 行:
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime2 26558 non-null datetime64[ns]
1 value1 25438 non-null float64
2 test_col_1 26558 non-null bool
3 test_col_2 26558 non-null bool
4 test_col_3 26558 non-null datetime64[ns]
5 test_col_4 25438 non-null datetime64[ns]
6 test_col_5 26558 non-null datetime64[ns]
7 datetime3 26558 non-null object
dtypes: bool(2), datetime64[ns](4), float64(1), object(1)
memory usage: 1.5+ MB
datetime2 value1 test_col_1 test_col_2 test_col_3 test_col_4 test_col_5 datetime3
0 2021-06-30 0.00058 False True 2022-03-31 2021-08-05 2021-06-30 1628121600000000000
1 2022-03-31 0.00044 False False 2022-03-31 2021-09-13 2022-03-31 1648684800000000000
2 2024-06-07 0.00860 False False 2022-03-31 2021-04-08 2024-06-07 1717718400000000000
3 2021-09-30 0.00867 False False 2022-03-31 2021-04-08 2021-09-30 1632960000000000000
4 2021-08-31 0.00144 False False 2022-03-31 2021-05-21 2021-08-31 1630368000000000000
5 2021-08-31 0.00144 False False 2022-03-31 2021-05-21 2021-08-31 1630368000000000000
6 2021-04-08 0.00474 False True 2022-03-31 2021-04-15 2021-04-08 1618444800000000000
7 2023-10-01 0.11506 False False 2022-03-31 2021-04-01 2023-10-01 1696118400000000000
8 2023-09-29 0.12067 False False 2022-03-31 2021-04-01 2023-09-29 1695945600000000000
9 2021-05-31 0.02508 False False 2022-03-31 2021-04-03 2021-05-31 1622419200000000000
我完全被这种行为搞糊涂了,请赐教!
先谢谢大家了!
使用 np.select
时,似乎将日期从纪元时间转换为 int64 表示形式。一个简单的解决方法是在 astype
# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')
tmp_df1['datetime3'] = np.select(
[
(tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
(tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
],
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
default=tmp_df1['datetime2']
).astype('datetime64[ns]') ### <--- add this
print(tmp_df1)
datetime2 value1 datetime3
0 2021-06-30 0.00058 2021-08-04
1 2023-10-01 0.11506 2023-10-01
更长的解释
我认为问题出在你的两个选择上,因为其中一个是单个值(第一个),第二个是一个系列。您可以看到当第二个选择也是系列时它也有效(使用 datetime dtype)
# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')
如果我使用你的方法我得到长整数表示(像你一样)
np.select(
...
[
datetime_const + pd.DateOffset(months=12),
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],...
)
# gives
array([1628035200000000000, 1696118400000000000], dtype=object)
但通过创建系列(与您的用例无关)替换第一个选择中的 datetime_const
np.select(
...
[
tmp_df1['datetime2'] + pd.DateOffset(months=12), # here replace the constant by the column datetime2 for example
datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
],
...
)
# get the good date format (wrong value of course)
array(['2021-08-04T00:00:00.000000000', '2023-10-01T00:00:00.000000000'],
dtype='datetime64[ns]')