自定义 pandas dtype 推断

Question

我正在处理多个大的 .csv 文件，每个文件都有许多不同的变量，将来可能会出现更多。

问题是，pandas 默认推断类型的方式不符合我的需要。例如，某些行没有值的数值变量最终被解释为 float64，即使它们本应用作整数。

例如，我想优先考虑 Int64Dtype 而不是 float64，而不必手动制作一个巨大的 dtypes 字典。

一个肮脏的解决方案是读取 .csv，用我自己的算法检查每个变量以构成我自己的 dtypes 字典，然后用字典重新打开 .csv 或更改每个变量。

我想知道是否有一种简单的方法来使用自定义推断，或者甚至只是为 dtype 检查设置不同的顺序，但一直找不到。

Answer 1

我想知道 pandas.read_csv 的 dtype 参数是否不是您要找的？您可以使用列名字典指定列的类型作为参数。

另一种方法是在 float64 列的转换中使用试错法：

for col, dtype in df.dtypes:
   if dtype == 'float64':
       try:
           df[col] = df[col].as_type('int64')
       except ValueError:
           pass

pandas:

提供的也有2种可能性

您可以使用 to_numeric 系列方法执行与上述相同的操作，该方法将给定数据的类型向下转换为尽可能小的值：

for col, dtype in df.dtypes:
   df[col] = df[col].to_numeric(downcast='integer')

同样，你可以使用convert_dtypes同时转换整数和浮点数：

for col, dtype in df.dtypes:
   df[col] = df[col].convert_dtypes(convert_integer=True, convert_floating=True, convert_string=False, convert_boolean=False)

好奇者：

pandas坚持使用浮动类型的原因是。

Answer 2

如果你想改变算法

方法存在时你必须去那个位置

位置可以通过代码知道

import pandas
import inspect
import os
os.path.dirname(inspect.getfile(pandas.read_csv))

也许会return~/~/~/~/~/lib/site-packages/pandas/io

您进入 parsers.py 并找到代码

from pandas import Int64Dtype # it have to added

def _infer_types(self, values, na_values, try_num_bool=True):
    """
    Infer types of values, possibly casting

    Parameters
    ----------
    values : ndarray
    na_values : set
    try_num_bool : bool, default try
       try to cast values to numeric (first preference) or boolean

    Returns
    -------
    converted : ndarray
    na_count : int
    """
    na_count = 0
    if issubclass(values.dtype.type, (np.number, np.bool_)):
        mask = algorithms.isin(values, list(na_values))
        na_count = mask.sum()
        if na_count > 0:
            if is_integer_dtype(values):
                values = values.astype(Int64Dtype())  #after change 
                #values = values.astype(np.float64)     #before change
            np.putmask(values, mask, np.nan)
        return values, na_count

也许它会解决您的问题

自定义 pandas dtype 推断

Custom pandas dtype inferring

python

casting

type-inference

pandas