使用具有混合值类型的列将巨大的 CSV 读入数据框的最佳方法

Question

我正在尝试将一个巨大的 CSV 文件（将近 5GB）读入 pandas 数据帧。此 CSV 只有 3 列，如下所示：

 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   STORE_ID  404944 non-null  int64 
 1   SIZE      404944 non-null  int64 
 2   DISTANCE  404944 non-null  object

问题是 DISTANCE 列应该只有 int64 数字，但不知何故它包含一些 \\N 形式的“空”值。这些 \\N 导致我的代码失败。不幸的是，我无法控制构建此 CSV，因此我无法事先更正它。

这是 CSV 样本：

STORE_ID,SIZE,DISTANCE
900072211,1,1000
900072212,1,1000
900072213,1,\N
900072220,5,4500

我需要这个 DISTANCE 列只有 int64 值。

由于 CSV 文件很大，我首先尝试使用以下代码读取它，并在开始时分配数据类型：

df = pd.read_csv("polygons.csv", dtype={"STORE_ID": int, "SIZE": int, "DISTANCE": int})

但是我得到了这个错误：

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

您将如何有效地将此 csv 准备好到数据框？有没有办法在阅读时将 dtype 分配给 DISTANCE 列？

Answer 1

使用na_values作为pd.read_csv的参数，应该可以解决你的问题：

df = pd.read_csv(..., na_values=r'\N')

输出：

>>> df
    STORE_ID  SIZE  DISTANCE
0  900072211     1    1000.0
1  900072212     1    1000.0
2  900072213     1       NaN
3  900072220     5    4500.0

>>> df.dtypes
STORE_ID      int64
SIZE          int64
DISTANCE    float64
dtype: object

更新

你也可以使用converters:

convert_N = lambda x: int(x) if x != r'\N' else 0
df = pd.read_csv(..., converters={'DISTANCE': convert_N})

输出：

>>> df
    STORE_ID  SIZE  DISTANCE
0  900072211     1      1000
1  900072212     1      1000
2  900072213     1         0
3  900072220     5      4500

>>> df.dtypes
x1    int64
x2    int64
x3    int64
dtype: object

使用具有混合值类型的列将巨大的 CSV 读入数据框的最佳方法

Best way to read a huge CSV into a dataframe with a column with mixed value types

python

csv

pandas