DBSCAN 每股收益和 min_samples

DBSCAN eps and min_samples

我一直在尝试使用 DBSCAN 来检测异常值,根据我的理解,DBSCAN 输出 -1 作为异常值,1 作为内联值,但是在我 运行 代码之后,我得到的数字是不是 -1 或 1,有人可以解释为什么吗?使用反复试验找到 eps 的最佳值也是正常的,因为我想不出找到最佳 eps 值的方法。

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.cluster import DBSCAN



df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)


# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']



#applying DBSCAN


DBSCAN = DBSCAN(eps = 1.8, min_samples =10,n_jobs=-1)

df['anomaly'] = DBSCAN.fit_predict(df)


np.unique(df['anomaly'],return_counts=True)

(array([  -1,    0,    1, ..., 8462, 8463, 8464]),
array([1737565, 3539278, 4455734, ...,      13,       8,       8]))

谢谢。

好吧,您实际上并没有真正了解 DBSCAN。

这是来自维基百科的副本:

A point p is a core point if at least minPts points are within distance ε of it (including p).

A point q is directly reachable from p if point q is within distance ε from core point p. Points are only said to be directly reachable from core points.

A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi. Note that this implies that all points on the path must be core points, with the possible exception of q.

All points not reachable from any other point are outliers or noise points.

所以用更简单的话来说,这个想法是:

  • 任何有 min_samples 个邻居的样本都是核心样本。

  • 任何不是核心的数据样本,但至少有一个核心邻居(距离小于eps),是一个直接可达的样本,可以添加到集群中。

  • 任何既不是直接可达也不是核心但至少有一个直接可达邻居(距离小于eps)的数据样本是可达样本,将被添加到集群中。

  • 任何其他示例都被认为是噪音、离群值或您想要命名的任何名称。(这些将被标记为 -1)

根据聚类的参数(eps 和 min_samples),您很可能有两个以上的聚类。你看,这就是你在聚类结果中看到 0 和 -1 以外的其他值的原因。

回答你的第二个问题

Also is it normal to find the best value of eps using trial and error,

如果你的意思是进行交叉验证(在你知道聚类标签或你可以近似正确聚类的集合上),是的,我认为这是正常的方法

PS: paper 非常好,很全面。我强烈建议你看看。祝你好运。

sklearn.cluster.DBSCAN给出噪声-1,这是一个outlier,除-1以外的所有其他值都是簇号或簇组。要查看集群总数,您可以使用命令 DBSCAN.labels_

DBScan 中使用的 eps 或 Epsilon 值是什么?

Epsilon is the local radius for expanding clusters. Think of it as a step size - DBSCAN never takes a step larger than this, but by doing multiple steps DBSCAN clusters can become much larger than eps.

如何找到最佳的 eps 值?

use any hyperparameter tuning method / package like GridSearchCV or Hyperopt. You can use any of the following indices mentioned here.

我发现这是了解 DBSCAN 工作原理的一个很好的例子。

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

a = np.array(labels)
a

结果:

array([ 0,  1,  0,  2,  0,  1,  1,  2,  0,  0,  1,  1,  1,  2,  1,  0, -1,
        1,  1,  2,  2,  2,  2,  2,  1,  1,  2,  0,  0,  2,  0,  1,  1,  0,
        1,  0,  2,  0,  0,  2,  2,  1,  1,  1,  1,  1,  0,  2,  0,  1,  2,
        2,  1,  1,  2,  2,  1,  0,  2,  1,  2,  2,  2,  2,  2,  0,  2,  2,
        0,  0,  0,  2,  0,  0,  2,  1, -1,  1,  0,  2,  1,  1,  0,  0,  0,
        0,  1,  2,  1,  2,  2,  0,  1,  0,  1, -1,  1,  1,  0,  0,  2,  1,
        2,  0,  2,  2,  2,  2, -1,  0, -1,  1,  1,  1,  1,  0,  0,  1,  0,
        1,  2,  1,  0,  0,  1,  2,  1,  0,  0,  2,  0,  2,  2,  2,  0, -1,
        2,  2,  0,  1,  0,  2,  0,  0,  2,  2, -1,  2,  1, -1,  2,  1,  1,
        2,  2,  2,  0,  1,  0,  1,  0,  1,  0,  2,  2, -1,  1,  2,  2,  1,
        0,  1,  2,  2,  2,  1,  1,  2,  2,  0,  1,  2,  0,  0,  2,  0,  0,
        1,  0,  1,  0,  1,  1,  2,  2,  0,  0,  1,  1,  2,  1,  2,  2,  2,
        2,  0,  2,  0,  2,  2,  0,  2,  2,  2,  0,  0,  1,  1,  1,  2,  2,
        2,  2,  1,  2,  2,  0,  0,  2,  0,  0,  0,  1,  0,  1,  1,  1,  2,
        1,  1,  0,  1,  2,  2,  1,  2,  2,  1,  0,  0,  1,  1,  1,  0,  1,
        0,  2,  0,  2,  2,  2,  2,  2,  1,  1,  0,  0,  1,  1,  0,  0,  2,
        1, -1,  2,  1,  1,  2,  1,  2,  0,  2,  2,  0,  1,  2,  2,  0,  2,
        2,  0,  0,  2,  0,  2,  0,  2,  1,  0,  0,  0,  1,  2,  1,  2,  2,
        0,  2,  2,  0,  0,  2,  1,  1,  1,  1,  1,  0,  1,  1,  1,  1,  0,
        0,  1,  1,  1,  0,  2,  0,  1,  2,  2,  0,  0,  2,  0,  2,  1,  0,
        2,  0,  2,  0,  2,  2,  0,  1,  0,  1,  0,  2,  2,  1,  1,  1,  2,
        0,  2,  0,  2,  1,  2,  2,  0,  1,  0,  1,  0,  0,  0,  0,  2,  0,
        2,  0,  1,  0,  1,  2,  1,  1,  1,  0,  1,  1,  0,  2,  1,  0,  2,
        2,  1,  1,  2,  2,  2,  1,  2,  1,  2,  0,  2,  1,  2,  1,  0,  1,
        0,  1,  1,  0,  1,  2, -1,  1,  0,  0,  2,  1,  2,  2,  2,  2,  1,
        0,  0,  0,  0,  1,  0,  2,  1,  0,  1,  2,  0,  0,  1,  0,  1,  1,
        0, -1,  0,  2,  2,  2,  1,  1,  2,  0,  1,  0,  0,  1,  0,  1,  1,
        2,  2, -1,  0,  1,  2,  2,  1,  1,  1,  1,  0,  0,  0,  2,  2,  1,
        2,  1,  0,  0,  1,  2,  1,  0,  0,  2,  0,  1,  0,  2,  1,  0,  2,
        2,  1,  0,  0,  0,  2,  1,  1,  0,  2,  0,  0,  1,  1,  1,  1,  0,
        1,  0,  1,  0,  0,  2,  0,  1,  1,  2,  1,  1,  0,  1,  0,  2,  1,
        0,  0,  1,  0,  1,  1,  2,  2,  1,  2,  2,  1,  2,  1,  1,  1,  1,
        2,  0,  0,  0,  1,  2,  2,  0,  2,  0,  2,  1,  0,  1,  1,  0,  0,
        1,  2,  1,  2,  2,  0,  2,  1,  1,  1,  2,  0,  0,  2,  0,  2,  2,
        0,  2,  0,  1,  1,  1,  1,  0,  0,  0,  2,  1,  1,  1,  1,  2,  2,
        2,  0,  2,  1,  1,  0,  0,  1,  0,  2,  1,  2,  1,  0,  2,  2,  0,
        0,  1,  0,  0,  2,  0,  0,  0,  2,  0,  2,  0,  0,  1,  1,  0,  0,
        1,  2,  2,  0,  0,  0,  0,  2, -1,  1,  1,  2,  1,  0,  0,  2,  2,
        0,  1,  2,  0,  1,  2,  2,  1,  0,  0, -1, -1,  2,  0,  0,  0,  2,
       -1,  2,  0,  1,  1,  1,  1,  1,  0,  0,  2,  1,  2,  0,  1,  1,  1,
        0,  2,  1,  1, -1,  2,  1,  2,  0,  2,  2,  1,  0,  0,  0,  1,  1,
        2,  0,  0,  2,  2,  1,  2,  2,  2,  0,  2,  1,  2,  1,  1,  1,  2,
        0,  2,  0,  2,  2,  0,  0,  2,  1,  2,  0,  2,  0,  0,  0,  1,  0,
        2,  1,  2,  0,  1,  0,  0,  2,  0,  2,  1,  1,  2,  1,  0,  1,  2,
        1,  2], dtype=int64)

那些 -1 数据点是异常值。让我们计算异常值的数量,看看它是否与我们在上图中看到的相符。

list(a)
b = a.tolist()
count = b.count(-1)
count

结果:

18

我们得到了 18 个!完美的!!

相关Link:

https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py