如何将整数数据表(从 Python 数据表库)正确转换为 pandas Dataframe

How to convert correctly a datatable of integers (from Python datatable library) to pandas Dataframe

我正在使用 Python 数据表 (https://github.com/h2oai/datatable) 读取仅包含整数值的 csv 文件。之后我将数据表转换为 pandas Dataframe。在转换时,仅包含 0/1 的列被视为布尔值而不是整数。

让以下 csv 文件 (small_csv_file_test.csv):

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
 2, 2, 2, 2, 2, 2, 2, 1, 0, 1
 3, 3, 3, 3, 3, 3, 3, 0, 0, 1
 4, 4, 4, 4, 4, 4, 4, 1, 0, 0
 5, 5, 5, 5, 5, 5, 5, 0, 0, 0
 6, 6, 6, 6, 6, 6, 6, 0, 0, 0
 7, 7, 7, 7, 7, 7, 7, 1, 1, 0
 8, 8, 8, 8, 8, 8, 8, 1, 1, 1
 9, 9, 9, 9, 9, 9, 9, 1, 1, 1
 0, 0, 0, 0, 0, 0, 0, 1, 0, 1

源代码:

import pandas as pd
import datatable as dt

test_csv_matrix = "small_csv_file_test.csv"

data = dt.fread(test_csv_matrix)
print(data.head(5))

matrix= data.to_pandas()
print(matrix.head())

结果:

   | a1  a2  a3  a4  a5  a6  a7  a8  a9  a10  
-- + --  --  --  --  --  --  --  --  --  ---  
 0 |  1   1   1   1   1   1   1   0   1    1  
 1 |  2   2   2   2   2   2   2   1   0    1  
 2 |  3   3   3   3   3   3   3   0   0    1  
 3 |  4   4   4   4   4   4   4   1   0    0  
 4 |  5   5   5   5   5   5   5   0   0    0  

[5 行 x 10 列]

   a1  a2  a3  a4  a5  a6  a7     a8     a9    a10  
0   1   1   1   1   1   1   1  False   True   True  
1   2   2   2   2   2   2   2   True  False   True  
2   3   3   3   3   3   3   3  False  False   True  
3   4   4   4   4   4   4   4   True  False  False  
4   5   5   5   5   5   5   5  False  False  False  

编辑 1: a8、a9 和 a10 列不正确,我希望它们是整数值而不是布尔值。

感谢您的帮助。

您可以将每一列强制转换为 int64:

matrix = data.to_pandas().astype('int64')

你可以随时推入数据类型

df = pd.DataFrame({"a1":[1,2,3,4,5,6,7,8,9,0],"a2":[1,2,3,4,5,6,7,8,9,0],"a3":[1,2,3,4,5,6,7,8,9,0],"a4":[1,2,3,4,5,6,7,8,9,0],"a5":[1,2,3,4,5,6,7,8,9,0],"a6":[1,2,3,4,5,6,7,8,9,0],"a7":[1,2,3,4,5,6,7,8,9,0],"a8":[0,1,0,1,0,0,1,1,1,1],"a9":[1,0,0,0,0,0,1,1,1,0],"a10":[1,1,1,0,0,0,0,1,1,1]})
df = df.astype({c:"int64" for c in df.columns})
df.dtypes


将此代码添加到您的代码段中。

matrix = matrix.iloc[:].astype(int)
matrix

输出:

   a1   a2  a3  a4  a5  a6  a7  a8  a9  a10
0   1   1   1   1   1   1   1   0   1   1
1   2   2   2   2   2   2   2   1   0   1
2   3   3   3   3   3   3   3   0   0   1
3   4   4   4   4   4   4   4   1   0   0
4   5   5   5   5   5   5   5   0   0   0
5   6   6   6   6   6   6   6   0   0   0

你可以这样做:

import datatable as dt
x = dt.Frame({"a": ["1", "2", "3"], "b":["20", "30", "40"]})
x.stypes
#(stype.str32, stype.str32)
x[:,:] = dt.int64
x.stypes
#(stype.int64, stype.int64)