一个热编码列中的多个分类数据

One Hot Encoding Multiple Categorical Data in a Column

这里是初学者。我想在我的数据框上使用一种热编码,该数据框在一列中有多个分类数据。我的数据框看起来像这样,尽管列中有更多内容以至于我无法手动完成:

Title       column
Movie 1   Action, Fantasy
Movie 2   Fantasy, Drama
Movie 3   Action
Movie 4   Sci-Fi, Romance, Comedy
Movie 5   NA
etc.

我想要的输出:

 Title     Action  Fantasy  Drama  Sci-Fi  Romance  Comedy
Movie 1     1       1        0      0        0       0
Movie 2     0       1        1      0        0       0
Movie 3     1       0        0      0        0       0
Movie 4     0       0        0      1        1       1
Movie 5     0       0        0      0        0       0  
etc.

谢谢!

考虑输入数据为:

import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'], 
        'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
    Title   column
0   Movie 1 Action, Fantasy
1   Movie 2 Fantasy, Drama
2   Movie 3 Action
3   Movie 4 Sci-Fi, Romance, Comedy
4   Movie 5 NaN

此代码产生所需的输出:

# treat null values
df['column'].fillna('NA', inplace = True)

# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()

# flatten the list
flat_genre = [item for sublist in genre for item in sublist]

# convert to a set to make unique
set_genre = set(flat_genre)

# back to list
unique_genre = list(set_genre)

# remove NA
unique_genre.remove('NA')

# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in df.iterrows():
    for val in row.column.split(', '):
        if val != 'NA':
            df.loc[index, val] = 1

df.drop('column', axis = 1, inplace = True)    
df
    Title   Action  Fantasy Comedy  Sci-Fi  Drama   Romance
0   Movie 1 1       1       0       0       0       0
1   Movie 2 0       1       0       0       1       0
2   Movie 3 1       0       0       0       0       0
3   Movie 4 0       0       1       1       0       1
4   Movie 5 0       0       0       0       0       0

更新: 我在测试数据中添加了一个空值,并在解决方案的第一行适当地处理它。

### Import libraries and load sample data

import numpy as np
import pandas as pd

data = {
    'Movie 1': ['Action, Fantasy'],
    'Movie 2': ['Fantasy, Drama'],
    'Movie 3': ['Action'],
    'Movie 4': ['Sci-Fi, Romance, Comedy'],
    'Movie 5': ['NA'],
}

df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)

在这个阶段我们的 DataFrame 看起来像这样:

           column
Movie 1    Action, Fantasy
Movie 2    Fantasy, Drama
Movie 3    Action
Movie 4    Sci-Fi, Romance, Comedy
Movie 5    NA

现在,我们要问的问题是 - 对于给定的电影,给定的类型词 ("sub-string") 是否出现在 'column' 中?

为此,我们首先需要一个体裁词列表:

### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))

根据您的数据集有多大,这可能需要大量计算。您提到您已经知道唯一值。所以你可以手动定义可迭代的'genres'。

获取 OneHotVectors:

for genre in genres:
    df[genre] = df['column'].str.contains(genre).astype('int')

df.drop('column', axis=1, inplace=True)

我们循环遍历每个流派,我们询问流派是否存在于'column',这个returns一个True或False,分别转换为1或0——当我们转换为type( 'int').

我们最终得到:

          Action    Comedy  Drama   Fantasy Romance Sci-Fi
Movie 1        1         0      0         1       0      0
Movie 2        0         0      1         1       0      0
Movie 3        1         0      0         0       0      0
Movie 4        0         1      0         0       1      1
Movie 5        0         0      0         0       0      0