使用 pandas reader 将文本文件读入 numpy 数组时出现问题

Question

我有一个庞大的文本文件，跳过后的虚拟版本看起来像这样 headers:

1444455        7        8        12 52 45 68 70

1356799        3        3        45 34 23 22 11

我想将其读入一个 numpy 数组，但 np.loadtxt 运行速度非常慢。文件的名称是 data.txt。现在我正在使用：

u=pd.read_csv('data.txt',dtype=np.float16,header=3).values

参数我都玩过了，没用。如果我省略 dtype，我会为数组中的每一行得到一长串数字。当我插入 dtype 时，出现错误：float() 的无效文字。我相信我在文本文件中使用的两种分隔符（制表符和单个空格）也存在一些混淆。我怎样才能把它变成一个形状为 (2,8) 的 numpy 数组。

各位高手能帮忙吗？谢谢

Answer 1

如果分隔符是空格并且 header=None:

，那么 read_csv 中似乎需要 delim_whitespace=True

然后转换为float:

u=pd.read_csv('data.txt', delim_whitespace=True, header=None).astype(float).values

print (u)
[[  1.44445500e+06   7.00000000e+00   8.00000000e+00   1.20000000e+01
    5.20000000e+01   4.50000000e+01   6.80000000e+01   7.00000000e+01]
 [  1.35679900e+06   3.00000000e+00   3.00000000e+00   4.50000000e+01
    3.40000000e+01   2.30000000e+01   2.20000000e+01   1.10000000e+01]]

但是有 numpy.float64:

u=pd.read_csv('data.txt', delim_whitespace=True, header=None).astype(float)

print (type(u.loc[0,0]))
<class 'numpy.float64'>

如果使用dtype=np.float16得到inf:

u=pd.read_csv('data.txt', dtype=np.float16, delim_whitespace=True, header=None).values
print (u)
[[ inf   7.   8.  12.  52.  45.  68.  70.]
 [ inf   3.   3.  45.  34.  23.  22.  11.]]

使用 pandas reader 将文本文件读入 numpy 数组时出现问题

Issue with reading text file into numpy array using pandas reader

arrays

numpy

pandas

reader