有条件地对很多特征进行二值化

Question

我有 Pandas 数据框，其中包含数百个分类特征（以数字表示）。我只想在列中保留最高值。我已经知道，每列中只有 3 或 4 个最常见的值，但我想自动 select 它。我需要两种方法：

1) 只保留 3 个最常见的值。概念：没有列具有 1、2 或 3 个唯一值（每列约 20 个唯一值），因此，不要考虑它。例如，如果您有多个第三名，请将它们全部保留。例如：

#after you use value_counts() column 1
1 35
2 23
3 10
4 9
8 8
6 8

#after you use value_counts() on column 2
0 23
2 15
1 15 #two second places
4 9
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 25 #9+8+8

#result after you use value_counts() on column 2
0 23
2 15
1 15
4 9
others 5 #3+2

2) 根据需要在每列中保留尽可能多的值，以便剩余值的数量少于您决定保留的最后一个值的数量。例如：

#after you use value_counts() column 1
1 35
2 23
3 10
4 3
8 2
6 1

#after you use value_counts() on column 2
0 23
2 15
1 9
4 8
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 6 #3+2+1

#result after you use value_counts() on column 2
0 23
2 15
1 9
4 8
others 5 #3+2

请同时进行。谢谢。

Answer 1

让我们用您的逻辑试试 udf：

def my_count(s):
    x = s.value_counts()
    if len(x) > 3:
        ret = x.iloc[:3].copy()
        ret.loc['other'] = x.iloc[3:].sum()
    else:
        ret = x
    return ret

df[['col1']].apply(my_count)

输出：

       col1
1        35
2        23
3        10
other     6

Answer 2

我将展示我想在工作中使用 2 列数据。 局限性：第2、3、4位同时出现的并列在该解中没有收集到同一个单元格中。您可能需要根据您的目的进一步自定义此行为。

示例数据

有 2 列，每列 26 类。一列是分类的，另一列是数字的。特意选择示例数据以展示关系的影响。

import pandas as pd
import numpy as np

np.random.seed(2)  # reproducibility
df = pd.DataFrame(np.random.randint(65, 91, (1000, 2)), columns=["str", "num"])
df["str"] = list(map(chr, df["str"].values))

print(df)
    str  num
0     I   80
1     N   73
2     W   76
3     S   76
4     I   72
..   ..  ...
995   M   80
996   Q   70
997   P   66
998   I   87
999   F   83
[1000 rows x 2 columns]

所需函数

def count_top_n(df, n_top):

    # name of output columns
    def gen_cols(ls_str):
        for s in ls_str:
            yield s
            yield f"{s}_counts"

    df_count = pd.DataFrame(np.zeros((n_top+1, df.shape[1]*2), dtype=object),
                            index=range(1, n_top+2),
                            columns=list(gen_cols(df.columns.values)))  # df.shape[1] = #cols
    # process each column
    for i, col in enumerate(df):
        # count
        tmp = df[col].value_counts()
        assert len(tmp) > n_top, f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"

        # case 1: no ties at the 3rd place
        if tmp.iat[n_top - 1] != tmp.iat[n_top]:
            # fill in classes
            df_count.iloc[:n_top, 2*i] = tmp[:n_top].index.values
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top, 2*i+1] = tmp[:n_top]
            df_count.iloc[n_top, 2*i+1] = tmp[n_top:].sum()
        
        # case 2: ties
        else:
            # new termination location
            n_top_new = (tmp >= tmp.iat[n_top]).sum()
            # fill in classes
            df_count.iloc[:n_top-1, 2*i] = tmp.iloc[:n_top-1].index.values
            df_count.iloc[n_top-1, 2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top-1, 2*i+1] = tmp.iloc[:n_top-1].values
            df_count.iloc[n_top-1, 2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
            df_count.iloc[n_top, 2*i+1] = tmp.iloc[n_top_new:].values.sum()

    return df_count

输出：

生成一个human-readabletable。请注意，列 str.

的第 2、3 和 4 位是并列的

print(count_top_n(df, 3))
      str str_count       num num_count
1       V        52        71        51
2       Q        46        86        47
3  [B, K]  [46, 46]  [90, 67]  [46, 46]
4  (rest)       810    (rest)       810

Answer 3

使用以下函数：

def myFilter(col, maxOther = 0):
    unq = col.value_counts()
    if maxOther == 0:    # Return 3 MFV
        thr = unq.unique()[:3][-1]
        otherCnt = unq[unq < thr].sum()
        rv = col[col.isin(unq[unq >= thr].index)]
    else:    # Drop last LFV, no more than maxOther
        otherCnt = 0
        for i in unq[::-1]:
            if otherCnt + i >= maxOther: break
            otherCnt += i
        thrInd = unq.size - i + 1
        rv = col[col.isin(unq[:thrInd].index)]
    rv = rv.reset_index(drop=True)
    # print(f'  Trace {col.name}\nunq:\n{unq}\notherCnt: {otherCnt}')
    return rv

我的假设是两种变体之间的区别：

return 3 个最常见的值 (MFV)，
丢弃最后一个不太频繁的（其他）值

由 maxOther 参数控制。它的默认值 0 表示“第一个变体”。

所以要测试这两个变体，请调用它：

df.apply(myFilter) 对于第一个变体，
df.apply(myFilter, maxOther=10) 第二个变体。

要查看跟踪打印输出，请取消注释 print 指令在函数中。

有条件地对很多特征进行二值化

Binarize a lot of features with condition

python

frequency

dataframe

pandas

categorical-data

示例数据

所需函数