创建一个包含相似词的新列

Question

我有一列包含 pandas 数据框中的相似词：

My_Column
thereisacat
there-is_cat
mummy
mommy
mammy
Daniel-1
Daniel 
Bob

我想创建一个列，其中每行包含最相似的词，例如：

My_Column        Similar_to
thereisacat     [there-is_cat]
there-is_cat    [thereisacat]
mummy           [mommy, mammy]
mommy           [mummy, mammy]
mammy           [mummy, mommy]
Daniel-1        [Daniel]
Daniel          [Daniel-1]
Bob             []

为了计算相似度，我正在考虑以下内容：

(1)

import nltk
nltk.edit_distance()

(2)

 import difflib
        seq = difflib.SequenceMatcher()

(3)

 import jellyfish
    jellyfish.levenshtein_distance() # or jellyfish.jaro_distance()

我想知道如何应用这三种算法中的一种来创建一个列，该列列出与 My_Column 中最相似的词。

如果我是对的（也许不是我的代码），我应该做的是创建一个矩阵，其中包括 My_Column 中的所有行作为列，以便分配相似度值然后提取最相似。像这样：

def sim_metric(col1, col2): # actually should be more
    return SequenceMatcher(None, df[col1], df[col2]).ratio()

df['Similar_to'] = df.apply(sim_metric,
                          axis=1)

Answer 1

这应该有帮助：

from itertools import product

unique_words = set(df['My_column'])

similarity_matrix = {key: distance_algo(key[0], key[1]) for key in list(product(list(unique_words),repeat = 2))}

similar_to_map = {}
for word in unique_words:
  wlist = [(sword, similarity_matrix[(word, sword)]) for sword in unique_words if sword is not word]
  wlist = wlist.sort(key=lambda x: x[1])
  similar_to_map[word] = wlist[0][0]

df['Similar_to'] = df['My_column'].apply(lambda x: similar_to_map[x], axis=0)

Answer 2

首先，创建一个相似度矩阵，然后使用一些阈值过滤矩阵并追加结果。

import pandas as pd
import numpy as np
import nltk

similarity = {}
similiar_words = []
thresh = 3

df = pd.read_clipboard()

for word in df.My_Column:
    scores = {}
    for match in df.My_Column:
        score = nltk.edit_distance(word, match)
        scores[match] = score
    similarity[word] = scores

matrix_df = pd.DataFrame(similarity) 

for value in matrix_df.to_dict('dict').values():
    similiar_words.append([key for key, val in value.items() if 0 < val < thresh])

df['Similar_to'] = similiar_words

df

Out[1]: 
      My_Column      Similar_to
0   thereisacat  [there-is_cat]
1  there-is_cat   [thereisacat]
2         mummy  [mommy, mammy]
3         mommy  [mummy, mammy]
4         mammy  [mummy, mommy]
5      Daniel-1        [Daniel]
6        Daniel      [Daniel-1]
7           Bob              []

创建一个包含相似词的新列

Creating a new column which contain similar words

python

nltk

pandas