如何将数字部分匹配到字符串?
How can I partially match numbers to strings?
灵感来自,但现在我想介绍一下部分匹配的复杂性。
数据:
PreviousData = { 'Item' : ['abc-023','def-78','ghi-012','jkl-100','mno-01','pqr-890','stu-024','vwx-765','yza-789','uaza-400','fupa-499'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData
CurrentData = { 'Item' : ['ghi-012:XYZ','stu-024:Z','abc-023-100','mno-01-100:Z','jkl-100:Z-900','pqr-890-FR','def-78-RF-FR','vwx-765:NCVE','yza-789-YU'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData
部分匹配的例子是:abc-023 vs abc-023-100; stu-024 vs stu-024:Z 等
尝试过的代码:
PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value1")
CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value2")
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
#Code Does Not Take Into Account for Partial Matches of Items
非常感谢有关此的任何提示。
这是一种聚类问题,我将提供一个解决方案。
写完这篇文章后,我记得这正是 Google refine 解决的问题。您可以在此处阅读 open-source 版本 Open refine:https://guides.library.illinois.edu/openrefine/clustering
无论如何,首先我连接 Item
列中的所有字符串并将它们保存在列表 all_items
.
中
import pandas as pd
import numpy as np
prev = list(PreviousData.Item)
curr = list(CurrentData.Item)
all_items = prev+curr
all_items
['abc-023',
'def-78',
'ghi-012',
'jkl-100',
'mno-01',
'pqr-890',
'stu-024',
'vwx-765',
'yza-789',
'uaza-400',
'fupa-499',
'ghi-012:XYZ',
'stu-024:Z',
'abc-023-100',
'mno-01-100:Z',
'jkl-100:Z-900',
'pqr-890-FR',
'def-78-RF-FR',
'vwx-765:NCVE',
'yza-789-YU']
所以现在您想将相似的字符串组合在一起,例如 'abc-023'
和 'abc-023-100'
或 'pqr-890'
和 'dpqr-890-FR'
。在 all_items
中最多有两个相似的字符串,但通常这是一个更复杂的问题,因为一个字符串可以有多个相似的字符串,如何确定哪个字符串是最佳选择?这个问题的解决方法叫做clustering.
关于相似函数:这个例子似乎暗示你想匹配两个字符串,如果一个是另一个的子字符串。一般来说,相似函数有很多,您可以选择最适合您的应用程序。
我将展示一个解决方案,该解决方案使用 sklearn
的 DBSCAN
聚类和 difflib
的 SequenceMatcher
聚类。
在这种情况下,这可能有点矫枉过正,但它可能对更大的数据集和更复杂的字符串匹配任务有用。
la = len(all_items)
from difflib import SequenceMatcher
from sklearn.cluster import DBSCAN
# this is the distance function between string
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
# Note: SequenceMatcher ratio goes from 0 to 1, highest similarity is 1
# but since we’re building a distance matrix, highest similarity=minimal distance
diff_matrix = np.zeros((la, la))
for i in range(la):
for j in range(i, la):
diff_matrix[i,j] = diff(i,j)
diff_matrix[j,i] = diff_matrix[i,j]
pd.DataFrame(diff_matrix) # for pretty-printing (note: this is a symmetric matrix)
# all distances over 0.4 are too far (this means two strings match if SequenceMatcher ratio is >0.6)
db = DBSCAN(eps=0.4, min_samples=2, metric='precomputed').fit(diff_matrix)
现在我们已经对字符串进行了聚类。有多少个簇?
# number of clusters is the number of unique labels except for noise
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_
# 9
由于我们从 20 个项目开始,我们得到 9 个标签,看起来大多数标签都是两两匹配的。
这些是字符串簇:
clusters = {'label'+str(k):[] for k in set(labels)}
for k,v in zip(labels, all_items):
clusters['label'+str(k)].append(v)
clusters
# {‘label0': ['abc-023', 'abc-023-100'],
# 'label1': ['def-78', 'def-78-RF-FR'],
# 'label2': ['ghi-012', 'ghi-012:XYZ'],
# 'label3': ['jkl-100', 'jkl-100:Z-900'],
# 'label4': ['mno-01', 'mno-01-100:Z'],
# 'label5': ['pqr-890', 'pqr-890-FR'],
# 'label6': ['stu-024', 'stu-024:Z'],
# 'label7': ['vwx-765', 'vwx-765:NCVE'],
# 'label8': ['yza-789', 'yza-789-YU'],
# 'label-1': ['uaza-400', 'fupa-499']}
9 个字符串两两匹配,两个字符串不匹配(标签 =-1
)。
创建字典 normalized_strings
用于将每个字符串转换为每个集群中的唯一值。我在每组字符串中选择第一个值(例如在 ['abc-023', 'abc-023-100’]
组中,我选择 'abc-023’
.
normalized_strings = {all_items[k]: clusters['label'+str(labels[k])][0] if labels[k]>-1 else all_items[k] for k in range(len(all_items))}
normalized_strings
# {‘abc-023': 'abc-023',
# 'def-78': 'def-78',
# 'ghi-012': 'ghi-012',
# 'jkl-100': 'jkl-100',
# 'mno-01': 'mno-01',
# 'pqr-890': 'pqr-890',
# 'stu-024': 'stu-024',
# 'vwx-765': 'vwx-765',
# 'yza-789': 'yza-789',
# 'uaza-400': 'uaza-400',
# 'fupa-499': 'fupa-499',
# 'ghi-012:XYZ': 'ghi-012',
# 'stu-024:Z': 'stu-024',
# 'abc-023-100': 'abc-023',
# 'mno-01-100:Z': 'mno-01',
# 'jkl-100:Z-900': 'jkl-100',
# 'pqr-890-FR': 'pqr-890',
# 'def-78-RF-FR': 'def-78',
# 'vwx-765:NCVE': 'vwx-765',
# 'yza-789-YU': 'yza-789'}
有了这本词典,您现在可以“翻译”数据框中的所有字符串。
关于diff
函数:
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
diff(0, 13)
# 0.2222222222222222
'abc-023'
和'abc-023-100'
非常相似,因此距离很小
如果SequenceMatcher
成本太高,还可以定义一个更简单的字符串匹配函数,例如两个字符串匹配,如果一个是另一个的子字符串
diff = lambda i,j: 1 - int((all_items[i] in all_items[j]) or (all_items[j] in all_items[i]))
diff(0,13) # strings match, no difference
# 0
diff(0,12) # strings do not match
# 1
另请参阅:
- difflib.SequenceMatcher:灵活的 class 用于比较任何类型的序列对,只要序列元素是可哈希的
- sklearn
DBSCAN
- DBSCAN demo
灵感来自
数据:
PreviousData = { 'Item' : ['abc-023','def-78','ghi-012','jkl-100','mno-01','pqr-890','stu-024','vwx-765','yza-789','uaza-400','fupa-499'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData
CurrentData = { 'Item' : ['ghi-012:XYZ','stu-024:Z','abc-023-100','mno-01-100:Z','jkl-100:Z-900','pqr-890-FR','def-78-RF-FR','vwx-765:NCVE','yza-789-YU'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData
部分匹配的例子是:abc-023 vs abc-023-100; stu-024 vs stu-024:Z 等
尝试过的代码:
PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value1")
CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value2")
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
#Code Does Not Take Into Account for Partial Matches of Items
非常感谢有关此的任何提示。
这是一种聚类问题,我将提供一个解决方案。
写完这篇文章后,我记得这正是 Google refine 解决的问题。您可以在此处阅读 open-source 版本 Open refine:https://guides.library.illinois.edu/openrefine/clustering
无论如何,首先我连接 Item
列中的所有字符串并将它们保存在列表 all_items
.
import pandas as pd
import numpy as np
prev = list(PreviousData.Item)
curr = list(CurrentData.Item)
all_items = prev+curr
all_items
['abc-023',
'def-78',
'ghi-012',
'jkl-100',
'mno-01',
'pqr-890',
'stu-024',
'vwx-765',
'yza-789',
'uaza-400',
'fupa-499',
'ghi-012:XYZ',
'stu-024:Z',
'abc-023-100',
'mno-01-100:Z',
'jkl-100:Z-900',
'pqr-890-FR',
'def-78-RF-FR',
'vwx-765:NCVE',
'yza-789-YU']
所以现在您想将相似的字符串组合在一起,例如 'abc-023'
和 'abc-023-100'
或 'pqr-890'
和 'dpqr-890-FR'
。在 all_items
中最多有两个相似的字符串,但通常这是一个更复杂的问题,因为一个字符串可以有多个相似的字符串,如何确定哪个字符串是最佳选择?这个问题的解决方法叫做clustering.
关于相似函数:这个例子似乎暗示你想匹配两个字符串,如果一个是另一个的子字符串。一般来说,相似函数有很多,您可以选择最适合您的应用程序。
我将展示一个解决方案,该解决方案使用 sklearn
的 DBSCAN
聚类和 difflib
的 SequenceMatcher
聚类。
在这种情况下,这可能有点矫枉过正,但它可能对更大的数据集和更复杂的字符串匹配任务有用。
la = len(all_items)
from difflib import SequenceMatcher
from sklearn.cluster import DBSCAN
# this is the distance function between string
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
# Note: SequenceMatcher ratio goes from 0 to 1, highest similarity is 1
# but since we’re building a distance matrix, highest similarity=minimal distance
diff_matrix = np.zeros((la, la))
for i in range(la):
for j in range(i, la):
diff_matrix[i,j] = diff(i,j)
diff_matrix[j,i] = diff_matrix[i,j]
pd.DataFrame(diff_matrix) # for pretty-printing (note: this is a symmetric matrix)
# all distances over 0.4 are too far (this means two strings match if SequenceMatcher ratio is >0.6)
db = DBSCAN(eps=0.4, min_samples=2, metric='precomputed').fit(diff_matrix)
现在我们已经对字符串进行了聚类。有多少个簇?
# number of clusters is the number of unique labels except for noise
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_
# 9
由于我们从 20 个项目开始,我们得到 9 个标签,看起来大多数标签都是两两匹配的。 这些是字符串簇:
clusters = {'label'+str(k):[] for k in set(labels)}
for k,v in zip(labels, all_items):
clusters['label'+str(k)].append(v)
clusters
# {‘label0': ['abc-023', 'abc-023-100'],
# 'label1': ['def-78', 'def-78-RF-FR'],
# 'label2': ['ghi-012', 'ghi-012:XYZ'],
# 'label3': ['jkl-100', 'jkl-100:Z-900'],
# 'label4': ['mno-01', 'mno-01-100:Z'],
# 'label5': ['pqr-890', 'pqr-890-FR'],
# 'label6': ['stu-024', 'stu-024:Z'],
# 'label7': ['vwx-765', 'vwx-765:NCVE'],
# 'label8': ['yza-789', 'yza-789-YU'],
# 'label-1': ['uaza-400', 'fupa-499']}
9 个字符串两两匹配,两个字符串不匹配(标签 =-1
)。
创建字典 normalized_strings
用于将每个字符串转换为每个集群中的唯一值。我在每组字符串中选择第一个值(例如在 ['abc-023', 'abc-023-100’]
组中,我选择 'abc-023’
.
normalized_strings = {all_items[k]: clusters['label'+str(labels[k])][0] if labels[k]>-1 else all_items[k] for k in range(len(all_items))}
normalized_strings
# {‘abc-023': 'abc-023',
# 'def-78': 'def-78',
# 'ghi-012': 'ghi-012',
# 'jkl-100': 'jkl-100',
# 'mno-01': 'mno-01',
# 'pqr-890': 'pqr-890',
# 'stu-024': 'stu-024',
# 'vwx-765': 'vwx-765',
# 'yza-789': 'yza-789',
# 'uaza-400': 'uaza-400',
# 'fupa-499': 'fupa-499',
# 'ghi-012:XYZ': 'ghi-012',
# 'stu-024:Z': 'stu-024',
# 'abc-023-100': 'abc-023',
# 'mno-01-100:Z': 'mno-01',
# 'jkl-100:Z-900': 'jkl-100',
# 'pqr-890-FR': 'pqr-890',
# 'def-78-RF-FR': 'def-78',
# 'vwx-765:NCVE': 'vwx-765',
# 'yza-789-YU': 'yza-789'}
有了这本词典,您现在可以“翻译”数据框中的所有字符串。
关于diff
函数:
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
diff(0, 13)
# 0.2222222222222222
'abc-023'
和'abc-023-100'
非常相似,因此距离很小
如果SequenceMatcher
成本太高,还可以定义一个更简单的字符串匹配函数,例如两个字符串匹配,如果一个是另一个的子字符串
diff = lambda i,j: 1 - int((all_items[i] in all_items[j]) or (all_items[j] in all_items[i]))
diff(0,13) # strings match, no difference
# 0
diff(0,12) # strings do not match
# 1
另请参阅:
- difflib.SequenceMatcher:灵活的 class 用于比较任何类型的序列对,只要序列元素是可哈希的
- sklearn
DBSCAN
- DBSCAN demo