手动格式化数据 Python

Question

请帮助手动编码我的序列数据集，而不是使用 sklearn，之前我使用 sklearn 库来编码



k1,    m2,   A3,  A4,   A5, P1
A1,    k2,   A7,  A9,   A9, P2
A99,   m77,  A22,  A22,   A22, P9

Answer 1

首先对一列执行此操作，然后将代码放入函数中，然后将运行放入 for - 循环使用不同的列

第一步是获取列中的唯一值

unique = sorted(df['COL1'].unique())

接下来使用 for 循环，对于 unique 中的每个 value，您可以运行

df['COL1'] == value

获取带有 True/False 的列，您可以将其转换为整数以获得 1 /0

(df['COL1'] == val).astype(int)

您可以使用新名称将其放回原始数据框

df['COL1' + '_' + val] = (df['COL1'] == val).astype(int)

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep=',\s+')

unique = sorted(df['COL1'].unique())
print(unique)

for val in unique:
    col = (df['COL1'] == val).astype(int)
    #col.name += '_' + val
    print(col)
    df[col.name + '_' + val] = col

print(df)

结果：

  COL1 COL2 COL3 COL4 COL5 LABELS  COL1_A1  COL1_A11  COL1_A99
0   A1   A2   A3   A4   A5     P1        1         0         0
1   A1   A2   A7   A9   A9     P2        1         0         0
2  A99  A77  A22  A22  A22     P9        0         0         1
3   A1   A2   A8   A9   A0     P7        1         0         0
4   A1   A2   A8  A90   A9     P2        1         0         0
5   A1  A21   A8   A9  A11     P1        1         0         0
6  A11   A2  A81   A9   A9     P1        0         1         0

它还需要删除 COL1 并且它需要记住某些字典中的唯一值，以便稍后您可以转换回这些值。

现在您必须将它放在某些函数中，运行用于其他列。

text = '''COL1, COL2, COL3, COL4, COL5, LABELS
A1,    A2,   A3,  A4,   A5, P1
A1,    A2,   A7,  A9,   A9, P2
A99,   A77,  A22,  A22,   A22, P9
A1,    A2,   A8,  A9,   A0, P7
A1,    A2,   A8,  A90,   A9, P2
A1,    A21,  A8,  A9,   A11, P1
A11,   A2,   A81,  A9,   A9, P1'''

import pandas as pd
import io

def convert(df, col_name):
    unique = sorted(df[col_name].unique())
    #print(unique)

    for val in unique:
        df[col_name + '_' + val] = (df[col_name] == val).astype(int)
    
    df.drop(columns=col_name, inplace=True)
    
    return unique

# ---

df = pd.read_csv(io.StringIO(text), sep=',\s+')

transformations = {}

for name in df.columns:
    if name.startswith('COL'):
        transformations[name] = convert(df, name)
        
print(transformations)        
    
print(df)

结果：

{'COL1': ['A1', 'A11', 'A99'], 
'COL2': ['A2', 'A21', 'A77'], 
'COL3': ['A22', 'A3', 'A7', 'A8', 'A81'], 
'COL4': ['A22', 'A4', 'A9', 'A90'], 
'COL5': ['A0', 'A11', 'A22', 'A5', 'A9']}

  LABELS  COL1_A1  COL1_A11  COL1_A99  ...  COL5_A11  COL5_A22  COL5_A5  COL5_A9
0     P1        1         0         0  ...         0         0        1        0
1     P2        1         0         0  ...         0         0        0        1
2     P9        0         0         1  ...         0         1        0        0
3     P7        1         0         0  ...         0         0        0        0
4     P2        1         0         0  ...         0         0        0        1
5     P1        1         0         0  ...         1         0        0        0
6     P1        0         1         0  ...         0         0        0        1

[7 rows x 21 columns]

手动格式化数据 Python

Manual format data Python

python

pandas

scikit-learn