如何改进 python 中的距离函数

Question

我正在尝试对电子邮件文档（包含单词的字符串）进行分类练习。

我定义的距离函数如下：

def distance(wordset1, wordset2):

 if len(wordset1) < len(wordset2):
    return len(wordset2) - len(wordset1)
 elif len(wordset1) > len(wordset2):
    return len(wordset1) - len(wordset2)
 elif len(wordset1) == len(wordset2):
    return 0

然而，最终的准确率很低（0.8）。我想这是因为距离函数不太准确。我该如何改进功能？或者还有什么其他方法可以计算电子邮件文档之间的 "distance"？

Answer 1

在这种情况下使用的一种常见的相似性度量是 Jaccard similarity。它的范围从 0 到 1，其中 0 表示完全不同，1 表示两个文档相同。它被定义为

wordSet1 = set(wordSet1)
wordSet2 = set(wordSet2)
sim = len(wordSet1.intersection(wordSet2))/len(wordSet1.union(wordSet2))

本质上就是词集交集与词集并集的比值。这有助于控制不同大小的电子邮件，同时仍然可以很好地衡量相似性。

Answer 2

你没有提到 wordset1 和 wordset2 的类型。我假设他们都是 strings.

您将距离定义为字数统计，结果得分很低。很明显文本长度不是一个很好的差异度量：两封不同大小的电子邮件可以谈论同一件事，而两封相同大小的电子邮件谈论的是完全不同的事情。

因此，按照上面的建议，您可以尝试检查相似词：

import numpy as np

def distance(wordset1, wordset2):
    wordset1 = set(wordset1.split())
    wordset2 = set(wordset2.split())

    common_words = wordset1 & wordset2
    if common_words:
        return 1 / len(common_words) 
    else:
        # They don't share any word. They are infinitely different.
        return np.inf

问题是两封大邮件比两封小邮件更有可能共享单词，而这个指标会偏爱那些，使它们 "more similar to each other" 与小邮件相比。我们如何解决这个问题？好吧，我们以某种方式规范化指标：

import numpy as np

def distance(wordset1, wordset2):
    wordset1 = set(wordset1.split())
    wordset2 = set(wordset2.split())

    common_words = wordset1 & wordset2
    if common_words:
        # The distance, normalized by the total 
        # number of different words in the emails.
        return 1 / len(common_words) / (len(wordset1 | wordset2))
    else:
        # They don't share any word. They are infinitely different.
        return np.inf

这看起来很酷，但完全忽略了单词的频率。为了解决这个问题，我们可以使用 scikit-learn 的 Bag-of-words model. That is, create a list of all possible words and histogram their appearance in each document. Let's use CountVectorizer 实现来简化我们的工作：

from sklearn.feature_extraction.text import CountVectorizer

def distance(wordset1, wordset2):
    model = CountVectorizer()
    X = model.fit_transform([wordset1, wordset2]).toarray()

    # uses Euclidean distance between bags.
    return np.linalg.norm(X[0] - X[1])

但现在考虑两对电子邮件。第一对中的电子邮件由完美的英文组成，充满 "small" 个单词（例如 a、an、is、and、that) 是语法正确所必需的。第二对邮件不同：只包含关键词，非常枯燥。你看，第一对很可能比第二对更相似。发生这种情况是因为我们目前对所有单词的解释都是一样的，而我们应该优先考虑每个文本中有意义的单词。为此，让我们使用 term frequency–inverse document frequency。幸运的是，scikit-learn 中有一个非常相似的实现：

from sklearn.feature_extraction.text import TfidfVectorizer

def distance(wordset1, wordset2):
    model = TfidfVectorizer()
    X = model.fit_transform([wordset1, wordset2]).toarray()

    similarity_matrix = X.dot(X.T)
    # The dissimilarity between samples wordset1 and wordset2.
    return 1-similarity_matrix[0, 1]

在此 question 中阅读更多相关信息。还有，重复？

您现在应该具有相当不错的准确性。试试看。如果还是没有你想要的那么好，那我们就要更深入了……（明白了吗？因为……深度学习）。首先，我们需要一个要训练的数据集或一个已经训练好的模型。这是必需的，因为网络有许多必须调整的参数才能提供有用的转换。

到目前为止缺少的是理解。我们对单词进行直方图绘制，将它们从任何上下文或含义中剥离出来。相反，让我们将它们保留在原处并尝试识别模式块。如何做到这一点？

将单词嵌入数字，这将处理不同大小的单词。
将每个数字（单词嵌入）序列填充到单一长度。
使用卷积网络从序列中提取有意义的特征。
使用全连接网络将提取的特征投影到space，使相似电子邮件之间的距离最小化，并使非相似电子邮件之间的距离最大化。

让我们用Keras来简化我们的生活。它应该看起来像这样：

# ... imports and params definitions

model = Sequential([
    Embedding(max_features,
              embedding_dims,
              input_length=maxlen,
              dropout=0.2),
    Convolution1D(nb_filter=nb_filter,
                  filter_length=filter_length,
                  border_mode='valid',
                  activation='relu',
                  subsample_length=1),
    MaxPooling1D(pool_length=model.output_shape[1]),
    Flatten(),
    Dense(256, activation='relu'),
])

# ... train or load model weights.

def distance(wordset1, wordset2):
    global model
    # X = ... # Embed both emails.
    X = sequence.pad_sequences(X, maxlen=maxlen)
    y = model.predict(X)
    # Euclidean distance between emails.
    return np.linalg.norm(y[0]-y[1])

有一个句子处理的实际例子，你可以看看Keras github repo. Also, someone solves this exact same problem using a siamese recurrent network in this Whosebug question。

好吧，我希望这能给你一些指导。 :-)

如何改进 python 中的距离函数

How to improve distance function in python

python

distance

knn