如何根据条件分配多个类别
How to assign multiple categories based on a condition
以下是类别,每个类别都有一个单词列表,将检查行是否匹配:
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
这是我的代码:(我正在检查关键字的句子并相应地为该行分配一个类别。我想允许重叠,所以一行可以有多个类别)
#check if description row contains words from one of our category lists
df['description'] = np.select(
[
(df['description'].str.contains('|'.join(fashion))),
(df['description'].str.contains('|'.join(general))),
(df['description'].str.contains('|'.join(decor))),
(df['description'].str.contains('|'.join(kitchen))),
(df['description'].str.contains('|'.join(holiday))),
(df['description'].str.contains('|'.join(garden))),
(df['description'].str.contains('|'.join(kids)))
],
['fashion','general','decor','kitchen','holiday','garden','kids'],
'Other'
)
Current Output:
index description category
0 children wine glass kids
1 candles decor
2 christmas tree holiday
3 bottle general
4 soldiers kids
5 bag fashion
Expected Output:
index description category
0 children wine glass kids, kitchen
1 candles decor
2 christmas tree holiday, garden
3 bottle general
4 soldiers kids
5 bag fashion
这是一个使用 apply()
的选项:
df = pd.DataFrame({'description': ['children wine glass',
'candles',
'christmas tree',
'bottle',
'soldiers',
'bag']})
def categorize(desc):
lst = []
for w in desc.split(' '):
if w in fashion:
lst.append('fashion')
if w in general:
lst.append('general')
if w in decor:
lst.append('decor')
if w in kitchen:
lst.append('kitchen')
if w in holiday:
lst.append('holiday')
if w in garden:
lst.append('garden')
if w in kids:
lst.append('kids')
return ', '.join(lst)
df.apply(lambda x: categorize(x.description), axis=1)
输出:
0 kids, kitchen
1 decor
2 holiday, garden
3 general
4 kids
5 fashion
这是我的做法。
每行上方的评论为您提供了我正在尝试做的事情的详细信息。
步骤:
- 将所有类别转换为
key:value
对。使用中的值
类别作为键,类别作为值。这是为了让您能够
搜索值并将其映射回键
- 使用以下方法将描述字段拆分为多列
拆分(展开)
- 对每列的键值进行匹配。结果将是
类别和 NaN
- 将所有这些重新加入一个以“,”分隔的列中以获得最终结果,同时排除 NaN。再次对其应用 pd.unique() 以删除重复的类别
您需要的六行代码是:
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
如果您有更多类别,只需将其添加到 dict_keys 和 dict_cats。其他一切都保持不变。
带注释的完整代码从这里开始:
import pandas as pd
c = ['description','category']
d = [['children wine glass','kids'],
['candles','decor'],
['christmas tree','holiday'],
['bottle','general'],
['soldiers','kids'],
['bag','fashion']]
df = pd.DataFrame(d,columns = c)
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
#create a list of all the lists
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)
#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)
#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ', '
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
print (df)
此输出将是:(我保留了您的 category
列并创建了一个名为 new_category
的新列
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 soldiers kids kids
5 bag fashion fashion
包含'party candles holder'
的输出是:
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 party candles holder None holiday, decor
5 soldiers kids kids
6 bag fashion fashion
以下是类别,每个类别都有一个单词列表,将检查行是否匹配:
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
这是我的代码:(我正在检查关键字的句子并相应地为该行分配一个类别。我想允许重叠,所以一行可以有多个类别)
#check if description row contains words from one of our category lists
df['description'] = np.select(
[
(df['description'].str.contains('|'.join(fashion))),
(df['description'].str.contains('|'.join(general))),
(df['description'].str.contains('|'.join(decor))),
(df['description'].str.contains('|'.join(kitchen))),
(df['description'].str.contains('|'.join(holiday))),
(df['description'].str.contains('|'.join(garden))),
(df['description'].str.contains('|'.join(kids)))
],
['fashion','general','decor','kitchen','holiday','garden','kids'],
'Other'
)
Current Output:
index description category
0 children wine glass kids
1 candles decor
2 christmas tree holiday
3 bottle general
4 soldiers kids
5 bag fashion
Expected Output:
index description category
0 children wine glass kids, kitchen
1 candles decor
2 christmas tree holiday, garden
3 bottle general
4 soldiers kids
5 bag fashion
这是一个使用 apply()
的选项:
df = pd.DataFrame({'description': ['children wine glass',
'candles',
'christmas tree',
'bottle',
'soldiers',
'bag']})
def categorize(desc):
lst = []
for w in desc.split(' '):
if w in fashion:
lst.append('fashion')
if w in general:
lst.append('general')
if w in decor:
lst.append('decor')
if w in kitchen:
lst.append('kitchen')
if w in holiday:
lst.append('holiday')
if w in garden:
lst.append('garden')
if w in kids:
lst.append('kids')
return ', '.join(lst)
df.apply(lambda x: categorize(x.description), axis=1)
输出:
0 kids, kitchen
1 decor
2 holiday, garden
3 general
4 kids
5 fashion
这是我的做法。
每行上方的评论为您提供了我正在尝试做的事情的详细信息。
步骤:
- 将所有类别转换为
key:value
对。使用中的值 类别作为键,类别作为值。这是为了让您能够 搜索值并将其映射回键 - 使用以下方法将描述字段拆分为多列 拆分(展开)
- 对每列的键值进行匹配。结果将是 类别和 NaN
- 将所有这些重新加入一个以“,”分隔的列中以获得最终结果,同时排除 NaN。再次对其应用 pd.unique() 以删除重复的类别
您需要的六行代码是:
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
如果您有更多类别,只需将其添加到 dict_keys 和 dict_cats。其他一切都保持不变。
带注释的完整代码从这里开始:
import pandas as pd
c = ['description','category']
d = [['children wine glass','kids'],
['candles','decor'],
['christmas tree','holiday'],
['bottle','general'],
['soldiers','kids'],
['bag','fashion']]
df = pd.DataFrame(d,columns = c)
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
#create a list of all the lists
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)
#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)
#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ', '
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
print (df)
此输出将是:(我保留了您的 category
列并创建了一个名为 new_category
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 soldiers kids kids
5 bag fashion fashion
包含'party candles holder'
的输出是:
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 party candles holder None holiday, decor
5 soldiers kids kids
6 bag fashion fashion