创建一个包含相似词的新列
Creating a new column which contain similar words
我有一列包含 pandas 数据框中的相似词:
My_Column
thereisacat
there-is_cat
mummy
mommy
mammy
Daniel-1
Daniel
Bob
我想创建一个列,其中每行包含最相似的词,例如:
My_Column Similar_to
thereisacat [there-is_cat]
there-is_cat [thereisacat]
mummy [mommy, mammy]
mommy [mummy, mammy]
mammy [mummy, mommy]
Daniel-1 [Daniel]
Daniel [Daniel-1]
Bob []
为了计算相似度,我正在考虑以下内容:
(1)
import nltk
nltk.edit_distance()
(2)
import difflib
seq = difflib.SequenceMatcher()
(3)
import jellyfish
jellyfish.levenshtein_distance() # or jellyfish.jaro_distance()
我想知道如何应用这三种算法中的一种来创建一个列,该列列出与 My_Column
中最相似的词。
如果我是对的(也许不是我的代码),我应该做的是创建一个矩阵,其中包括 My_Column 中的所有行作为列,以便分配相似度值然后提取最相似。
像这样:
def sim_metric(col1, col2): # actually should be more
return SequenceMatcher(None, df[col1], df[col2]).ratio()
df['Similar_to'] = df.apply(sim_metric,
axis=1)
这应该有帮助:
from itertools import product
unique_words = set(df['My_column'])
similarity_matrix = {key: distance_algo(key[0], key[1]) for key in list(product(list(unique_words),repeat = 2))}
similar_to_map = {}
for word in unique_words:
wlist = [(sword, similarity_matrix[(word, sword)]) for sword in unique_words if sword is not word]
wlist = wlist.sort(key=lambda x: x[1])
similar_to_map[word] = wlist[0][0]
df['Similar_to'] = df['My_column'].apply(lambda x: similar_to_map[x], axis=0)
首先,创建一个相似度矩阵,然后使用一些阈值过滤矩阵并追加结果。
import pandas as pd
import numpy as np
import nltk
similarity = {}
similiar_words = []
thresh = 3
df = pd.read_clipboard()
for word in df.My_Column:
scores = {}
for match in df.My_Column:
score = nltk.edit_distance(word, match)
scores[match] = score
similarity[word] = scores
matrix_df = pd.DataFrame(similarity)
for value in matrix_df.to_dict('dict').values():
similiar_words.append([key for key, val in value.items() if 0 < val < thresh])
df['Similar_to'] = similiar_words
df
Out[1]:
My_Column Similar_to
0 thereisacat [there-is_cat]
1 there-is_cat [thereisacat]
2 mummy [mommy, mammy]
3 mommy [mummy, mammy]
4 mammy [mummy, mommy]
5 Daniel-1 [Daniel]
6 Daniel [Daniel-1]
7 Bob []
我有一列包含 pandas 数据框中的相似词:
My_Column
thereisacat
there-is_cat
mummy
mommy
mammy
Daniel-1
Daniel
Bob
我想创建一个列,其中每行包含最相似的词,例如:
My_Column Similar_to
thereisacat [there-is_cat]
there-is_cat [thereisacat]
mummy [mommy, mammy]
mommy [mummy, mammy]
mammy [mummy, mommy]
Daniel-1 [Daniel]
Daniel [Daniel-1]
Bob []
为了计算相似度,我正在考虑以下内容:
(1)
import nltk
nltk.edit_distance()
(2)
import difflib
seq = difflib.SequenceMatcher()
(3)
import jellyfish
jellyfish.levenshtein_distance() # or jellyfish.jaro_distance()
我想知道如何应用这三种算法中的一种来创建一个列,该列列出与 My_Column
中最相似的词。
如果我是对的(也许不是我的代码),我应该做的是创建一个矩阵,其中包括 My_Column 中的所有行作为列,以便分配相似度值然后提取最相似。 像这样:
def sim_metric(col1, col2): # actually should be more
return SequenceMatcher(None, df[col1], df[col2]).ratio()
df['Similar_to'] = df.apply(sim_metric,
axis=1)
这应该有帮助:
from itertools import product
unique_words = set(df['My_column'])
similarity_matrix = {key: distance_algo(key[0], key[1]) for key in list(product(list(unique_words),repeat = 2))}
similar_to_map = {}
for word in unique_words:
wlist = [(sword, similarity_matrix[(word, sword)]) for sword in unique_words if sword is not word]
wlist = wlist.sort(key=lambda x: x[1])
similar_to_map[word] = wlist[0][0]
df['Similar_to'] = df['My_column'].apply(lambda x: similar_to_map[x], axis=0)
首先,创建一个相似度矩阵,然后使用一些阈值过滤矩阵并追加结果。
import pandas as pd
import numpy as np
import nltk
similarity = {}
similiar_words = []
thresh = 3
df = pd.read_clipboard()
for word in df.My_Column:
scores = {}
for match in df.My_Column:
score = nltk.edit_distance(word, match)
scores[match] = score
similarity[word] = scores
matrix_df = pd.DataFrame(similarity)
for value in matrix_df.to_dict('dict').values():
similiar_words.append([key for key, val in value.items() if 0 < val < thresh])
df['Similar_to'] = similiar_words
df
Out[1]:
My_Column Similar_to
0 thereisacat [there-is_cat]
1 there-is_cat [thereisacat]
2 mummy [mommy, mammy]
3 mommy [mummy, mammy]
4 mammy [mummy, mommy]
5 Daniel-1 [Daniel]
6 Daniel [Daniel-1]
7 Bob []