如何在200+数值变量中识别分类变量？

Question

我有一个包含 200 多个数值变量（类型：int）的数据集。在这些变量中，一些是分类变量，其值如 (0,1)、(0,1,2,3,4) 等

我需要识别这些分类变量并将它们虚拟化。识别和模拟它们需要很多时间 - 有什么方法可以轻松做到吗？

Answer 1

您可以说某些变量是分类变量，或者根据其唯一值的长度将它们视为分类变量。例如，如果一个变量只有唯一值 [-2,4,56]，您可以将此变量视为分类变量。

import pandas as pd import numpy as np col = [c for c in train.columns if c not in ['id','target']] numclasses=[] for c in col: numclasses.append(len(np.unique(train[[c]]))) threshold=10 categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]

每个被视为分类的变量中的每个唯一值都将创建一个新列。如果你不想以后创建很多列作为虚拟列，你可以使用小阈值。

Answer 2

可能与 What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

重复

这个 post 有更多答案。任何一个都可能对你有帮助。看看

Answer 3

使用nunique() 函数获取每列中唯一值的数量，然后过滤列。使用您的最佳判断来初始化 threshold 值。将特征转换为分类类型

category_features = []
threshold = 10
for each in df.columns:
    if df[each].nunique() < threshold:
        category_features.append(each)

for each in category_features:
    df[each] = df[each].astype('category')

如何在200+数值变量中识别分类变量？

How to identify the categorical variables in the 200+ numerical variables?

machine-learning

python-3.x

data-cleaning

data-science