如何使 Pandas 将包含 NaT 的列从 timedelta 转换为 datetime?

How can I make Pandas convert a column which contains NaT from timedelta to datetime?

我有一个 pandas 数据框,其中一列的类型为 timedelta64[ns],我想将其转换为 datetime64[ns].

pd.to_datetime() 函数声称可以做到这一点,并且在过去有效,但现在似乎失败了。我认为这可能与我没注意到的 API 怪癖有关。目前它失败了:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 724, in to_datetime
    cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  File "/usr/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 152, in _maybe_cache
    cache_dates = convert_listlike(unique_dates, format)
  File "/usr/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 363, in _convert_listlike_datetimes
    arg, _ = maybe_convert_dtype(arg, copy=False)
  File "/usr/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1916, in maybe_convert_dtype
    raise TypeError(f"dtype {data.dtype} cannot be converted to datetime64[ns]")
TypeError: dtype timedelta64[ns] cannot be converted to datetime64[ns]

要尝试重现,请使用下面的 MWE:

wget https://chymera.eu/ppb/61ebad.csv
python
import pandas as pd
df = pd.read_csv('61ebad.csv')
df['Animal_death_date'] = pd.to_timedelta(df['Animal_death_date'], errors='coerce')
df['Animal_death_date'] = pd.to_datetime(df['Animal_death_date'], errors='coerce')

如果我使用 errors='ignore',也会出现此错误。 作为参考,我使用 Pandas 1.0.1.

如果需要将时间增量转换为日期时间,请添加一些开始日期时间:

import pandas as pd

df = pd.read_csv('https://chymera.eu/ppb/61ebad.csv')
start = pd.to_datetime('2000-01-01')
df['Animal_death_date'] = pd.to_timedelta(df['Animal_death_date'], errors='coerce') + start
print (df['Animal_death_date'] )
0                     NaT
1                     NaT
2                     NaT
3                     NaT
4                     NaT

843                   NaT
844                   NaT
845   2000-05-12 19:00:00
846   2000-05-12 19:00:00
847   2000-05-12 19:00:00
Name: Animal_death_date, Length: 848, dtype: datetime64[ns]

或者添加一些由日期时间填充的列:

import pandas as pd

df = pd.read_csv('https://chymera.eu/ppb/61ebad.csv')
start = pd.to_datetime(df['FMRIMeasurement_date'])
df['Animal_death_date'] = pd.to_timedelta(df['Animal_death_date'], errors='coerce') + start
print (df['Animal_death_date'] )
0                     NaT
1                     NaT
2                     NaT
3                     NaT
4                     NaT

843                   NaT
844                   NaT
845   2018-10-04 19:20:54
846   2018-10-04 19:20:54
847   2018-10-04 19:20:54
Name: Animal_death_date, Length: 848, dtype: datetime64[ns]

从一个小的更正开始:您的来源列也是 一个 text 列,但只有 formatted as timedelta.

要转换 Animal_death_date 列定义以下函数:

def myDateConv(tt):
    return pd.to_datetime('2020-' + tt, format='%Y-%j days %X.%f')\
        if len(tt) > 0 else np.nan

我假设你的日期是今年,因此 2020 作为初始日期 整个日期字符串的一部分。如果他们来自其他年份,请更改此 相应地加上前缀。

但在您阅读源文件时尽早应用此功能:

df = pd.read_csv('61ebad.csv', index_col=0, parse_dates=['Treatment_start_date',
    'Treatment_end_date', 'FMRIMeasurement_date', 'OpenFieldTestMeasurement_date',
    'ForcedSwimTestMeasurement_date', 'CageStay_start_date', 'Cage_Treatment_start_date',
    'Cage_Treatment_end_date', 'SucrosePreferenceMeasurement_date', 'reference_date'],
    converters = { 'Animal_death_date': myDateConv })

注意附加参数:

  • index_col - 将初始列视为索引,
  • parse_dates - 将 "normally" 格式的日期转换为 datetime,
  • converters - 将上述函数应用于源 Animal_death_date列。

我认为,这个解决方案比单独转换更简单,更具可读性 特定列。