将具有 <NA> 值的数据类型 Int64 的列转换为具有 nan 值的对象

Question

一个教程有这个数据框 sequels 如下：

              title sequel
id                        
19995        Avatar    nan
862       Toy Story    863
863     Toy Story 2  10193
597         Titanic    nan
24428  The Avengers    nan

<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 19995 to 185567
Data columns (total 2 columns):
title     4803 non-null object
sequel    4803 non-null object
dtypes: object(2)
memory usage: 272.6+ KB

教程提供了一个文件sequels.p。但是，当我读入文件时，我的数据框与教程中的数据框不同

my_sequels = pd.read_pickle('data/pandas/sequels.p')
my_sequels.set_index('id', inplace=True)
my_sequels.head()
             title  sequel
id      
19995       Avatar  <NA>
862      Toy Story  863
863    Toy Story 2  10193
597        Titanic  <NA>
24428  The Avengers <NA>

sequels.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 19995 to 185567
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   4803 non-null   object
 1   sequel  90 non-null     Int64 
dtypes: Int64(1), object(1)
memory usage: 117.3+ KB

我的问题是：有没有一种方法可以操纵 my_sequels 使其类似于 sequels，也就是说，将 my_sequels['sequel'] 作为 4803 非空的对象，其中 <NA> 变成 nan?

编辑：我想让my_sequels与sequels相同的原因是为了避免后续步骤中的错误：

sequels_fin = my_sequels.merge(financials, on='id', how='left')

orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', 
                             right_on='id', right_index=True,
                             suffixes=('_org','_seq'))

ValueError                                Traceback (most recent call last)
<ipython-input-5-7215de303684> in <module>
      3 orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', 
      4                              right_on='id', right_index=True,
----> 5                              suffixes=('_org','_seq'))
ValueError: cannot convert to 'int64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

Answer 1

我想你不会想要的。您看到此消息的原因是本教程基于 Pandas 比您正在使用的版本更旧的版本。

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

您可以像您预期的那样检测缺失值并对其进行操作。

arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
arr.isna()
array([False, False,  True])
arr.fillna(0)
<IntegerArray>
[1, 2, 0]
Length: 3, dtype: Int64

Answer 2

第一个索引'id':

sequels_fin = sequels_fin.set_index('id')

之后：

orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', 
                             right_on='id', right_index=True,
                             suffixes=('_org','_seq'))

将具有 <NA> 值的数据类型 Int64 的列转换为具有 nan 值的对象

Convert a column of data type Int64 with <NA> values to object with nan values

python

pickle

pandas