为什么 pd.concat 将结果数据类型从 int 更改为 float？

Question

我有三个数据帧：timestamp（带有时间戳）、dataSun（带有日出和日落的时间戳）、dataData（带有不同的气候数据）。数据帧 timestamp 的数据类型为 "int64"。

timestamp.head() timestamp 0 1521681600000 1 1521681900000 2 1521682200000 3 1521682500000 4 1521682800000

数据帧 dataSun 也有数据类型 "int64"。

 dataSun.head()
         sunrise         sunset
0  1521696105000  1521740761000
1  1521696105000  1521740761000
2  1521696105000  1521740761000
3  1521696105000  1521740761000
4  1521696105000  1521740761000

具有气候数据 dataData 的数据框具有数据类型 "float64"。

dataData.head()
           temperature     pressure  humidity
    0     2.490000  1018.000000      99.0
    1     2.408333  1017.833333      99.0
    2     2.326667  1017.666667      99.0
    3     2.245000  1017.500000      99.0
    4     2.163333  1017.333333      99.0
    5     2.081667  1017.166667      99.0

我想将这三个数据帧连接成一个。

dataResult = pd.concat((timestamp, dataSun, dataData), axis = 1)
dataResult.head()
       timestamp       sunrise        sunset  temperature     pressure     
0  1521681600000  1.521696e+12  1.521741e+12     2.490000  1018.000000   
1  1521681900000  1.521696e+12  1.521741e+12     2.408333  1017.833333   
2  1521682200000  1.521696e+12  1.521741e+12     2.326667  1017.666667   
3  1521682500000  1.521696e+12  1.521741e+12     2.245000  1017.500000   
4  1521682800000  1.521696e+12  1.521741e+12     2.163333  1017.333333   
5  1521683100000  1.521696e+12  1.521741e+12     2.081667  1017.166667   

weatherMeasurements.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7188 entries, 0 to 7187
Data columns (total 6 columns):
timestamp      7188 non-null int64
sunrise        7176 non-null float64
sunset         7176 non-null float64
temperature    7176 non-null float64
pressure       7176 non-null float64
humidity       7176 non-null float64
dtypes: float64(5), int64(1)

为什么 pd.concat 改变了值 DataSun 的数据类型？我尝试了不同的方法来连接数据帧。例如，我在一个数据帧中只连接了 timestamp 和 dataSun，然后我将结果数据帧与 dataData 连接起来。但结果是一样的。如何连接三个数据帧并保护数据类型？

Answer 1

因为这个 -

timestamp      7188 non-null int64
sunrise        7176 non-null float64
...

timestamp 有 7188 个非空值，而 sunrise 及以后有 7176 个。不用说，有 12 个值是 not非空...意味着它们是 NaN。

由于 NaN 属于 dtype=float，因此该列中的所有其他值都会自动向上转换为浮点数，并且大的浮点数通常以科学记数法表示。

这就是原因，但这并不能真正解决您的问题。此时您的选择是

使用 dropna 删除那些带有 NaN 的行
使用 fillna

（现在您可以将这些行向下转换为 int。）

或者，如果您使用 join='inner' 执行 pd.concat，则不会引入 NaN 并保留数据类型。

pd.concat((timestamp, dataSun, dataData), axis=1, join='inner')

       timestamp        sunrise         sunset  temperature     pressure  \    
0  1521681600000  1521696105000  1521740761000     2.490000  1018.000000   
1  1521681900000  1521696105000  1521740761000     2.408333  1017.833333   
2  1521682200000  1521696105000  1521740761000     2.326667  1017.666667   
3  1521682500000  1521696105000  1521740761000     2.245000  1017.500000   
4  1521682800000  1521696105000  1521740761000     2.163333  1017.333333   

   humidity  
0      99.0  
1      99.0  
2      99.0  
3      99.0  
4      99.0

使用选项 3，对每个数据帧的索引执行内部连接。

Answer 2

从 pandas 1.0.0 开始，我相信您还有另一种选择，即首先使用 convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaNs discussed in 答案。

为什么 pd.concat 将结果数据类型从 int 更改为 float？

Why does pd.concat change the resulting datatype from int to float?

python

concat

dataframe

pandas