一个热编码列中的多个分类数据
One Hot Encoding Multiple Categorical Data in a Column
这里是初学者。我想在我的数据框上使用一种热编码,该数据框在一列中有多个分类数据。我的数据框看起来像这样,尽管列中有更多内容以至于我无法手动完成:
Title column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
etc.
我想要的输出:
Title Action Fantasy Drama Sci-Fi Romance Comedy
Movie 1 1 1 0 0 0 0
Movie 2 0 1 1 0 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 0 0 1 1 1
Movie 5 0 0 0 0 0 0
etc.
谢谢!
考虑输入数据为:
import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'],
'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
Title column
0 Movie 1 Action, Fantasy
1 Movie 2 Fantasy, Drama
2 Movie 3 Action
3 Movie 4 Sci-Fi, Romance, Comedy
4 Movie 5 NaN
此代码产生所需的输出:
# treat null values
df['column'].fillna('NA', inplace = True)
# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()
# flatten the list
flat_genre = [item for sublist in genre for item in sublist]
# convert to a set to make unique
set_genre = set(flat_genre)
# back to list
unique_genre = list(set_genre)
# remove NA
unique_genre.remove('NA')
# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)
# for each value inside column, update the dummy
for index, row in df.iterrows():
for val in row.column.split(', '):
if val != 'NA':
df.loc[index, val] = 1
df.drop('column', axis = 1, inplace = True)
df
Title Action Fantasy Comedy Sci-Fi Drama Romance
0 Movie 1 1 1 0 0 0 0
1 Movie 2 0 1 0 0 1 0
2 Movie 3 1 0 0 0 0 0
3 Movie 4 0 0 1 1 0 1
4 Movie 5 0 0 0 0 0 0
更新:
我在测试数据中添加了一个空值,并在解决方案的第一行适当地处理它。
### Import libraries and load sample data
import numpy as np
import pandas as pd
data = {
'Movie 1': ['Action, Fantasy'],
'Movie 2': ['Fantasy, Drama'],
'Movie 3': ['Action'],
'Movie 4': ['Sci-Fi, Romance, Comedy'],
'Movie 5': ['NA'],
}
df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)
在这个阶段我们的 DataFrame 看起来像这样:
column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
现在,我们要问的问题是 - 对于给定的电影,给定的类型词 ("sub-string") 是否出现在 'column' 中?
为此,我们首先需要一个体裁词列表:
### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))
根据您的数据集有多大,这可能需要大量计算。您提到您已经知道唯一值。所以你可以手动定义可迭代的'genres'。
获取 OneHotVectors:
for genre in genres:
df[genre] = df['column'].str.contains(genre).astype('int')
df.drop('column', axis=1, inplace=True)
我们循环遍历每个流派,我们询问流派是否存在于'column',这个returns一个True或False,分别转换为1或0——当我们转换为type( 'int').
我们最终得到:
Action Comedy Drama Fantasy Romance Sci-Fi
Movie 1 1 0 0 1 0 0
Movie 2 0 0 1 1 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 1 0 0 1 1
Movie 5 0 0 0 0 0 0
这里是初学者。我想在我的数据框上使用一种热编码,该数据框在一列中有多个分类数据。我的数据框看起来像这样,尽管列中有更多内容以至于我无法手动完成:
Title column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
etc.
我想要的输出:
Title Action Fantasy Drama Sci-Fi Romance Comedy
Movie 1 1 1 0 0 0 0
Movie 2 0 1 1 0 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 0 0 1 1 1
Movie 5 0 0 0 0 0 0
etc.
谢谢!
考虑输入数据为:
import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'],
'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
Title column
0 Movie 1 Action, Fantasy
1 Movie 2 Fantasy, Drama
2 Movie 3 Action
3 Movie 4 Sci-Fi, Romance, Comedy
4 Movie 5 NaN
此代码产生所需的输出:
# treat null values
df['column'].fillna('NA', inplace = True)
# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()
# flatten the list
flat_genre = [item for sublist in genre for item in sublist]
# convert to a set to make unique
set_genre = set(flat_genre)
# back to list
unique_genre = list(set_genre)
# remove NA
unique_genre.remove('NA')
# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)
# for each value inside column, update the dummy
for index, row in df.iterrows():
for val in row.column.split(', '):
if val != 'NA':
df.loc[index, val] = 1
df.drop('column', axis = 1, inplace = True)
df
Title Action Fantasy Comedy Sci-Fi Drama Romance
0 Movie 1 1 1 0 0 0 0
1 Movie 2 0 1 0 0 1 0
2 Movie 3 1 0 0 0 0 0
3 Movie 4 0 0 1 1 0 1
4 Movie 5 0 0 0 0 0 0
更新: 我在测试数据中添加了一个空值,并在解决方案的第一行适当地处理它。
### Import libraries and load sample data
import numpy as np
import pandas as pd
data = {
'Movie 1': ['Action, Fantasy'],
'Movie 2': ['Fantasy, Drama'],
'Movie 3': ['Action'],
'Movie 4': ['Sci-Fi, Romance, Comedy'],
'Movie 5': ['NA'],
}
df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)
在这个阶段我们的 DataFrame 看起来像这样:
column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
现在,我们要问的问题是 - 对于给定的电影,给定的类型词 ("sub-string") 是否出现在 'column' 中?
为此,我们首先需要一个体裁词列表:
### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))
根据您的数据集有多大,这可能需要大量计算。您提到您已经知道唯一值。所以你可以手动定义可迭代的'genres'。
获取 OneHotVectors:
for genre in genres:
df[genre] = df['column'].str.contains(genre).astype('int')
df.drop('column', axis=1, inplace=True)
我们循环遍历每个流派,我们询问流派是否存在于'column',这个returns一个True或False,分别转换为1或0——当我们转换为type( 'int').
我们最终得到:
Action Comedy Drama Fantasy Romance Sci-Fi
Movie 1 1 0 0 1 0 0
Movie 2 0 0 1 1 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 1 0 0 1 1
Movie 5 0 0 0 0 0 0