在列中搜索字符串并使用字典键对其进行分类
Search column for a string and Classify them with a dictionary keys
我已经导入了我从 Linkedin 导出的关于我的联系人的电子表格,并希望将人们的职位分类到不同的级别。
因此,我创建了一个字典,其中包含用于查找每个职位级别的术语。
字典的第一个版本是:
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'] }
而且我需要一个代码来读取电子表格中的每个位置,检查是否有这些术语和 return 另一个特定列中的等效键。
我正在阅读的数据帧的样本数据是:
sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'],
['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'],
['Consultant'], ['Founder', 'and', 'CTO'],
['Intern'], ['Manager', 'Specialist'],
['Administrador', 'de', 'Novos', 'Negócios'],
['Administrador', 'de', 'Serviços']))
哪个Returns:
0 [(blank)]
1 [Estagiário]
2 [Professor, Adjunto]
3 [CEO, and, Founder]
4 [Engenheiro, de, Produção]
5 [Consultant]
6 [Founder, and, CTO]
7 [Intern]
8 [Manager, Specialist]
9 [Administrador, de, Novos, Negócios]
10 [Administrador, de, Serviços]
dtype: object
我完成了以下代码:
import pandas as pd
plan = pd.read_excel('SpreadSheet Name.xlsx', sheet_name = 'Positions')
list0 = ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner']
list1 = ['Director', 'Head']
list2 = ['Manager', 'Administrador']
listgeral = [dic0, dic1, dic2]
def in_list(list_to_search,terms_to_search):
results = [item for item in list_to_search if item in terms_to_search]
if len(results) > 0:
return '0 - CEO, Founder'
else:
pass
plan['PositionLevel'] = plan['Position'].str.split().apply(lambda x: in_list(x, listgeral[0]))
实际输出:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' None
2 'Professor Adjunto' None
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' None
5 'Consultant' None
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' None
8 'Manager Specialist' None
9 'Administrador de Novos Negócios' None
预期输出:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' '5 - Estagiário'
2 'Professor Adjunto' '7 - Professor'
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' '3 - Engenheiro'
5 'Consultant' '4 - Consultor'
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' '5 - Estagiário'
8 'Manager Specialist' '2 - Manager'
9 'Administrador de Novos Negócios' '2 - Manager'
首先,我计划 运行 我的 listgeral
中每个列表的代码,但我无法这样做。然后我开始相信将这个应用于一本大词典会更好,就像问题开头的 dicpositions
和术语的关键字 return 一样。
我已尝试将以下代码应用于此程序:
dictest = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador']}
def in_dic (x, dictest):
for key in dictest:
for elem in dictest[key]:
if elem == x:
return key
return False
in_dic('CEO', dictest)
的输出是 '0 - CEO, Founder'
并且,例如,in_dic('Banana', dictest)
的输出是 False
但我无法从中取得进展并应用此功能 in_dic()
解决我的问题。
非常感谢任何人的帮助。
非常感谢。
我冒昧地对您的输入进行了一些重构,但这就是我得到的(它可能有点过度设计)。简而言之,我们使用一个名为 jellyfish (pip3 install jellyfish
, code taken from this answer) 的库来进行模糊字符串匹配,以将 excel sheet 中的位置与 dicpositions
中的位置进行匹配,然后将它们映射到同一字典中的类别。这是导入和匹配函数:
import pandas as pd
import numpy as np
import jellyfish
# Function for fuzzy-matching strings
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
# Keep an eye out for "blank" values, they can be strings, e.g. "(blank)", or e.g. NaN values
no_values = ["(blank)", np.nan, None]
if x in no_values:
return "(blank)"
# Find which string most closely matches our input and return it
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if current_score > highest_jw:
highest_jw = current_score
best_match = current_string
return best_match
好的,这是你的 dicpositions
,为了方便起见,我将其翻译成长格式的 DataFrame:
# Translations between keywords and their category, as dict, as provided in question
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'],
'Not found"': ["(blank)"] # <-- I added this to deal with blank values
}
# Let's expand the dict above to a DF, which makes for easier merging later
positions = []
aliases = []
for key, val in dicpositions.items():
for v in val:
positions.append(key)
aliases.append(v)
# This will serve as our mapping table
lookup_table = pd.DataFrame({
"position": positions,
"alias": aliases
})
print(lookup_table)
它不是字典,而是长格式的 DataFrame。这种格式使得以后将类别与各种关键字匹配起来非常容易:
position alias
0 0 - CEO, Founder CEO
1 0 - CEO, Founder Founder
2 0 - CEO, Founder Co-Founder
3 0 - CEO, Founder Cofounder
4 0 - CEO, Founder Owner
5 1 - Director of Director
6 1 - Director of Head
7 2 - Manager Manager
8 2 - Manager Administrador
9 3 - Engenheiro Engenheiro
10 3 - Engenheiro Engineering
11 4 - Consultor Consultor
12 4 - Consultor Consultant
13 5 - Estagiário Estagiário
14 5 - Estagiário Intern
15 6 - Desempregado Self-Employed
16 6 - Desempregado Autônomo
17 7 - Professor Professor
18 7 - Professor Researcher
19 Not found" (blank)
让我们测试一些输入,看看匹配是如何工作的。我们检查您输入的每个字符串与 alias
列中的字符串,并且 return alias
列中的任何值与我们的输入数据最匹配(稍后我们将再次使用它,查找类别,或 position
):
# Test input, as a list, you might have to wrangle it from your format to a list, though
test_df = pd.DataFrame({"test_position": ["(blank)", 'Estagiário', 'Professor Adjunto', 'CEO and Founder', 'Engenheiro de produção', 'Consultant', 'Founder and CTO', 'Intern', 'Manager Specialist', 'Administrador de Novos Negócios']})
# Match our test input with our mapping table, create a new column 'best_match' representing the value in our mapping table that most closely matches our input
test_df["best_match"] = test_df.test_position.map(lambda x: get_closest_match(x, lookup_table.alias))
print(test_df)
我们的 test_df
添加了一个新列,表明我们查找 table 中的哪个 alias
与我们的 test_position
输入最相似:
test_position best_match
0 (blank) (blank)
1 Estagiário Estagiário
2 Professor Adjunto Professor
3 CEO and Founder CEO
4 Engenheiro de produção Engenheiro
5 Consultant Consultant
6 Founder and CTO Founder
7 Intern Intern
8 Manager Specialist Manager
9 Administrador de Novos Negócios Administrador
要结束类别,我们只需将测试数据中的 best_match
列与查找 table:
的 alias
列合并
result = test_df.merge(lookup_table, left_on="best_match", right_on="alias", how="left")
结果是:
test_position best_match alias position
0 (blank) (blank) (blank) Not found
1 Estagiário Estagiário Estagiário 5 - Estagiário
2 Professor Adjunto Professor Professor 7 - Professor
3 CEO and Founder CEO CEO 0 - CEO, Founder
4 Engenheiro de produção Engenheiro Engenheiro 3 - Engenheiro
5 Consultant Consultant Consultant 4 - Consultor
6 Founder and CTO Founder Founder 0 - CEO, Founder
7 Intern Intern Intern 5 - Estagiário
8 Manager Specialist Manager Manager 2 - Manager
9 Administrador de Novos Negócios Administrador Administrador 2 - Manager
我已经导入了我从 Linkedin 导出的关于我的联系人的电子表格,并希望将人们的职位分类到不同的级别。
因此,我创建了一个字典,其中包含用于查找每个职位级别的术语。
字典的第一个版本是:
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'] }
而且我需要一个代码来读取电子表格中的每个位置,检查是否有这些术语和 return 另一个特定列中的等效键。
我正在阅读的数据帧的样本数据是:
sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'],
['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'],
['Consultant'], ['Founder', 'and', 'CTO'],
['Intern'], ['Manager', 'Specialist'],
['Administrador', 'de', 'Novos', 'Negócios'],
['Administrador', 'de', 'Serviços']))
哪个Returns:
0 [(blank)]
1 [Estagiário]
2 [Professor, Adjunto]
3 [CEO, and, Founder]
4 [Engenheiro, de, Produção]
5 [Consultant]
6 [Founder, and, CTO]
7 [Intern]
8 [Manager, Specialist]
9 [Administrador, de, Novos, Negócios]
10 [Administrador, de, Serviços]
dtype: object
我完成了以下代码:
import pandas as pd
plan = pd.read_excel('SpreadSheet Name.xlsx', sheet_name = 'Positions')
list0 = ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner']
list1 = ['Director', 'Head']
list2 = ['Manager', 'Administrador']
listgeral = [dic0, dic1, dic2]
def in_list(list_to_search,terms_to_search):
results = [item for item in list_to_search if item in terms_to_search]
if len(results) > 0:
return '0 - CEO, Founder'
else:
pass
plan['PositionLevel'] = plan['Position'].str.split().apply(lambda x: in_list(x, listgeral[0]))
实际输出:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' None
2 'Professor Adjunto' None
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' None
5 'Consultant' None
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' None
8 'Manager Specialist' None
9 'Administrador de Novos Negócios' None
预期输出:
Position PositionLevel
0 '(blank)' None
1 'Estagiário' '5 - Estagiário'
2 'Professor Adjunto' '7 - Professor'
3 'CEO and Founder' '0 - CEO, Founder'
4 'Engenheiro de produção' '3 - Engenheiro'
5 'Consultant' '4 - Consultor'
6 'Founder and CTO' '0 - CEO, Founder'
7 'Intern' '5 - Estagiário'
8 'Manager Specialist' '2 - Manager'
9 'Administrador de Novos Negócios' '2 - Manager'
首先,我计划 运行 我的 listgeral
中每个列表的代码,但我无法这样做。然后我开始相信将这个应用于一本大词典会更好,就像问题开头的 dicpositions
和术语的关键字 return 一样。
我已尝试将以下代码应用于此程序:
dictest = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador']}
def in_dic (x, dictest):
for key in dictest:
for elem in dictest[key]:
if elem == x:
return key
return False
in_dic('CEO', dictest)
的输出是 '0 - CEO, Founder'
并且,例如,in_dic('Banana', dictest)
的输出是 False
但我无法从中取得进展并应用此功能 in_dic()
解决我的问题。
非常感谢任何人的帮助。
非常感谢。
我冒昧地对您的输入进行了一些重构,但这就是我得到的(它可能有点过度设计)。简而言之,我们使用一个名为 jellyfish (pip3 install jellyfish
, code taken from this answer) 的库来进行模糊字符串匹配,以将 excel sheet 中的位置与 dicpositions
中的位置进行匹配,然后将它们映射到同一字典中的类别。这是导入和匹配函数:
import pandas as pd
import numpy as np
import jellyfish
# Function for fuzzy-matching strings
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
# Keep an eye out for "blank" values, they can be strings, e.g. "(blank)", or e.g. NaN values
no_values = ["(blank)", np.nan, None]
if x in no_values:
return "(blank)"
# Find which string most closely matches our input and return it
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if current_score > highest_jw:
highest_jw = current_score
best_match = current_string
return best_match
好的,这是你的 dicpositions
,为了方便起见,我将其翻译成长格式的 DataFrame:
# Translations between keywords and their category, as dict, as provided in question
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
'1 - Director of': ['Director', 'Head'],
'2 - Manager': ['Manager', 'Administrador'],
'3 - Engenheiro': ['Engenheiro', 'Engineering'],
'4 - Consultor': ['Consultor', 'Consultant'],
'5 - Estagiário': ['Estagiário', 'Intern'],
'6 - Desempregado': ['Self-Employed', 'Autônomo'],
'7 - Professor': ['Professor', 'Researcher'],
'Not found"': ["(blank)"] # <-- I added this to deal with blank values
}
# Let's expand the dict above to a DF, which makes for easier merging later
positions = []
aliases = []
for key, val in dicpositions.items():
for v in val:
positions.append(key)
aliases.append(v)
# This will serve as our mapping table
lookup_table = pd.DataFrame({
"position": positions,
"alias": aliases
})
print(lookup_table)
它不是字典,而是长格式的 DataFrame。这种格式使得以后将类别与各种关键字匹配起来非常容易:
position alias
0 0 - CEO, Founder CEO
1 0 - CEO, Founder Founder
2 0 - CEO, Founder Co-Founder
3 0 - CEO, Founder Cofounder
4 0 - CEO, Founder Owner
5 1 - Director of Director
6 1 - Director of Head
7 2 - Manager Manager
8 2 - Manager Administrador
9 3 - Engenheiro Engenheiro
10 3 - Engenheiro Engineering
11 4 - Consultor Consultor
12 4 - Consultor Consultant
13 5 - Estagiário Estagiário
14 5 - Estagiário Intern
15 6 - Desempregado Self-Employed
16 6 - Desempregado Autônomo
17 7 - Professor Professor
18 7 - Professor Researcher
19 Not found" (blank)
让我们测试一些输入,看看匹配是如何工作的。我们检查您输入的每个字符串与 alias
列中的字符串,并且 return alias
列中的任何值与我们的输入数据最匹配(稍后我们将再次使用它,查找类别,或 position
):
# Test input, as a list, you might have to wrangle it from your format to a list, though
test_df = pd.DataFrame({"test_position": ["(blank)", 'Estagiário', 'Professor Adjunto', 'CEO and Founder', 'Engenheiro de produção', 'Consultant', 'Founder and CTO', 'Intern', 'Manager Specialist', 'Administrador de Novos Negócios']})
# Match our test input with our mapping table, create a new column 'best_match' representing the value in our mapping table that most closely matches our input
test_df["best_match"] = test_df.test_position.map(lambda x: get_closest_match(x, lookup_table.alias))
print(test_df)
我们的 test_df
添加了一个新列,表明我们查找 table 中的哪个 alias
与我们的 test_position
输入最相似:
test_position best_match
0 (blank) (blank)
1 Estagiário Estagiário
2 Professor Adjunto Professor
3 CEO and Founder CEO
4 Engenheiro de produção Engenheiro
5 Consultant Consultant
6 Founder and CTO Founder
7 Intern Intern
8 Manager Specialist Manager
9 Administrador de Novos Negócios Administrador
要结束类别,我们只需将测试数据中的 best_match
列与查找 table:
alias
列合并
result = test_df.merge(lookup_table, left_on="best_match", right_on="alias", how="left")
结果是:
test_position best_match alias position
0 (blank) (blank) (blank) Not found
1 Estagiário Estagiário Estagiário 5 - Estagiário
2 Professor Adjunto Professor Professor 7 - Professor
3 CEO and Founder CEO CEO 0 - CEO, Founder
4 Engenheiro de produção Engenheiro Engenheiro 3 - Engenheiro
5 Consultant Consultant Consultant 4 - Consultor
6 Founder and CTO Founder Founder 0 - CEO, Founder
7 Intern Intern Intern 5 - Estagiário
8 Manager Specialist Manager Manager 2 - Manager
9 Administrador de Novos Negócios Administrador Administrador 2 - Manager