将带有分隔符（'|'）的字符串的单列转换为基于字符串值的二进制值的多列

Question

我有一百万条带有一列的数据框记录，其中包含多个以定界符作为分隔符的组合字符串。

在所需的数据框中，我需要保留该列并让多个列托管分隔的字符串作为列标题，并根据行中可用的组合使用二进制值。

这需要与其他特征相结合以提供给模型估计器。

附上数据样本以供参考。

x.head(20)
Genres
793754  Drama|Sci-Fi
974374  Drama|Romance
950027  Horror|Sci-Fi
998553  Comedy
757593  Action|Thriller
943002  Comedy|Romance
699895  Drama|Romance
228740  Animation|Comedy|Thriller
365470  Comedy
174365  Comedy|Fantasy
827401  Drama
75922   Comedy|Drama
934548  Animation|Children's|Comedy|Musical|Romance
281451  Comedy|Sci-Fi
694344  Sci-Fi
731063  Action|Adventure
978029  Animation|Comedy
283943  Drama|Sci-Fi|Thriller
961082  Action|Adventure|Fantasy|Sci-Fi
778922  Action|Crime|Romance

所需的列（18个）从具有独特功能的整个数据中提取为列表，并提供用于根据行字符串数据填充二进制0或1。

genre_movies=list(genre_movies.stack().unique())
genre_movies
['Drama',
 'Animation',
 "Children's",
 'Musical',
 'Romance',
 'Comedy',
 'Action',
 'Adventure',
 'Fantasy',
 'Sci-Fi',
 'War',
 'Thriller',
 'Crime',
 'Mystery',
 'Western',
 'Horror',
 'Film-Noir',
 'Documentary']

我是 Pandas 的新手，感谢您的帮助。

Answer 1

请检查这是否是您想要的：（我必须手动输入流派，所以我只放了 3 行）

               Genres  Drama  Sci-Fi  Romance  Horror
793754   Drama|Sci-Fi   True    True    False   False
974374  Drama|Romance   True   False     True   False
950027  Horror|Sci-Fi  False    True    False    True

密码是：

import pandas as pd 
df = pd.DataFrame( {
                    'Genres' : ['Drama|Sci-Fi', 'Drama|Romance' , 'Horror|Sci-Fi']
                    },
                    index = [793754, 974374, 950027] , 
                    )
genre_movies=list(df.Genres.unique())
genre_movies2  = [words for segments in genre_movies for words in segments.split('|')]
# get a list of unique genres

for genre in genre_movies2:
    df[genre] = df.Genres.str.contains(genre, regex=False)

@Ins_hunter建议的方法二使用.get_dummies()方法

df2 = df.Genres.str.get_dummies(sep='|')

Action  Adventure   Animation   Children's  Comedy  Crime   Documentary Drama   Fantasy Film-Noir   Horror  Musical Mystery Romance Sci-Fi  Thriller    War Western
0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1   0   0   1   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0
3   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
4   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1000204 0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
1000205 0   0   0   0   0   0   0   1   0   0   0   0   0   1   0   0   1   0
1000206 0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0
1000207 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1000208 0   0   0   1   0   0   0   1   1   0   0   0   0   0   1   0   0   0
1000209 rows × 18 columns

并且可以合并回原始数据

df3 = pd.concat([df, df2], axis=1)

将带有分隔符（'|'）的字符串的单列转换为基于字符串值的二进制值的多列

Converting single column with strings with separator ('|') to multiple columns with binary values based on the string value

python

string

multiple-columns

dataframe

pandas