如何将包含分类数据和连续数据的数据集获取到 DBSCAN 中用户定义的度量函数中？

Question

我有一个包含连续和分类值的数据集。我想在 DBSCAN 中编写一个函数作为度量，它使用相同的欧几里得距离来连续并处理分类值，它必须用其他字符串值标识整个字符串值。如果这 2 个值相等，则必须将距离设为 0，如果它们不相等，则结果应为 1。当我尝试为度量编写用户定义的函数时，它根本没有将数据传递给我的函数。它会抛出像 "could not convert string to float: "'second'" " 这样的错误？有什么办法可以将数据传递给我的函数吗？

数据框看起来像：

        sundar call      raju   ram     sony  tintu  banti
points                                                    
x1         0.6  '0'   'first'  0.93   'lion'   0.34   0.98
x2         0.7  '1'  'second'  0.47    'cat'   0.43   0.76
x3         0.4  '0'   'third'  0.87  'tiger'   0.24   0.10
x4         0.6  '0'   'first'  0.93   'lion'   0.34   0.98
x5         0.5  '1'   'first'  0.32  'tiger'   0.09   0.99
x6         0.4  '0'   'third'  0.78  'tiger'   0.18   0.17
x7         0.5  '1'  'second'  0.98    'cat'   0.47   0.78

Answer 1

我想你应该用 "precomputed" 指标初始化 DBSCAN：

dbscan = sklearn.cluster.DBSCAN(metric="precomputed")

（其他参数省略）。然后计算所有样本之间的度量，得到形状为[n_samples, n_samples]的矩阵。

X = user_defined_metric(data, data)

然后使用此数据拟合 DBSCAN：

labels = dbscan.fit_predict(X)

根据 sklearn 文档，

fit_predict(X, y=None, sample_weight=None)

Performs clustering on X and returns cluster labels.
Parameters: 
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
    A feature array, or array of distances between samples if metric='precomputed'.

第二种情况 - 形状数组 [n_samples, n_samples] 是你的。

如何将包含分类数据和连续数据的数据集获取到 DBSCAN 中用户定义的度量函数中？

How to get a dataset that contains both the categorical data and continuous data into user defined metric function in DBSCAN?

python

cluster-analysis

data-mining

python-3.x

dbscan