计算给定年份的汉明距离
Calculating hamming distance in a given year
我有以下数据框:
Bacteria Year Feature_Vector
XYRT23 1968 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1968 [0 1 0 0 0 1 1 0 0 0 1 1]
RTy11R 1968 [1 0 0 0 0 1 1 0 1 1 1 1]
XYRT23 1969 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1969 [0 0 1 0 0 1 1 0 0 0 1 1]
RTy11R 1969 [1 0 0 0 0 1 1 1 1 1 1 1]
我想计算给定年份中每一对的成对汉明距离,并将其保存到新的数据框中。示例:(注意:我编造了汉明距离的数字,实际上我不需要配对列)
Pair Year HammingDistance
XYRT23 - XXQY12 1968 0.24
XYRT23 - RTy11R 1968 0.33
XXQY12 - RTy11R 1968 0.29
XYRT23 - XXQY12 1969 0.22
XYRT23 - RTy11R 1969 0.34
XXQY12 - RTy11R 1969 0.28
我试过类似的方法:
import itertools
from sklearn.metrics.pairwise import pairwise_distances
my_list = df.groupby('Year')['Feature_Vector'].apply(list)
total_list = []
for lists in my_list:
i = 0
results = []
for x in itertools.combinations(lists, 2):
vec1, vec2 = np.array(x[0]), np.array(x[1])
keepers = np.where(np.logical_not((np.vstack((vec1, vec2)) == 0).all(axis=0)))
vecx = vec1[keepers].reshape(1, -1)
vecy = vec2[keepers].reshape(1, -1)
try:
score = pairwise_distances(vecx, vecy, metric = "hamming")
print(score)
except:
score = 0
results.append(score)
函数 pairwise_distances
可以接受一个矩阵,因此将一年中的特征作为矩阵提供可能更容易,取回成对的距离矩阵,然后只对我们需要的比较进行子集化.例如,像您这样的数据集:
df = pd.DataFrame({'Bacteria':['XYRT23','XXQY12','RTy11R']*2,
'Year':np.repeat(['1968','1969'],3),
'Feature_Vector':list(np.random.binomial(1,0.5,(6,12)))})
type(df['Feature_Vector'][0])
numpy.ndarray
定义包含特征列和行名称的成对函数:
def pwdist(features , names):
dm = pairwise_distances(features.to_list(),metric="hamming")
m,n = dm.shape
dm[:] = np.where(np.arange(m)[:,None] >= np.arange(n),np.nan,dm)
dm = pd.DataFrame(dm,index = names,columns = names)
out = dm.stack().reset_index()
out.columns = ['Bacteria1','Bacteria2','distance']
return out
使用groupby并应用函数:
df.groupby('Year').apply(lambda x: pwdist(x.Feature_Vector,x.Bacteria.values))
给我们这样的东西:
Bacteria1 Bacteria2 distance
Year
1968 0 XYRT23 XXQY12 0.333333
1 XYRT23 RTy11R 0.250000
2 XXQY12 RTy11R 0.416667
1969 0 XYRT23 XXQY12 0.500000
1 XYRT23 RTy11R 0.333333
2 XXQY12 RTy11R 0.166667
我有以下数据框:
Bacteria Year Feature_Vector
XYRT23 1968 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1968 [0 1 0 0 0 1 1 0 0 0 1 1]
RTy11R 1968 [1 0 0 0 0 1 1 0 1 1 1 1]
XYRT23 1969 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1969 [0 0 1 0 0 1 1 0 0 0 1 1]
RTy11R 1969 [1 0 0 0 0 1 1 1 1 1 1 1]
我想计算给定年份中每一对的成对汉明距离,并将其保存到新的数据框中。示例:(注意:我编造了汉明距离的数字,实际上我不需要配对列)
Pair Year HammingDistance
XYRT23 - XXQY12 1968 0.24
XYRT23 - RTy11R 1968 0.33
XXQY12 - RTy11R 1968 0.29
XYRT23 - XXQY12 1969 0.22
XYRT23 - RTy11R 1969 0.34
XXQY12 - RTy11R 1969 0.28
我试过类似的方法:
import itertools
from sklearn.metrics.pairwise import pairwise_distances
my_list = df.groupby('Year')['Feature_Vector'].apply(list)
total_list = []
for lists in my_list:
i = 0
results = []
for x in itertools.combinations(lists, 2):
vec1, vec2 = np.array(x[0]), np.array(x[1])
keepers = np.where(np.logical_not((np.vstack((vec1, vec2)) == 0).all(axis=0)))
vecx = vec1[keepers].reshape(1, -1)
vecy = vec2[keepers].reshape(1, -1)
try:
score = pairwise_distances(vecx, vecy, metric = "hamming")
print(score)
except:
score = 0
results.append(score)
函数 pairwise_distances
可以接受一个矩阵,因此将一年中的特征作为矩阵提供可能更容易,取回成对的距离矩阵,然后只对我们需要的比较进行子集化.例如,像您这样的数据集:
df = pd.DataFrame({'Bacteria':['XYRT23','XXQY12','RTy11R']*2,
'Year':np.repeat(['1968','1969'],3),
'Feature_Vector':list(np.random.binomial(1,0.5,(6,12)))})
type(df['Feature_Vector'][0])
numpy.ndarray
定义包含特征列和行名称的成对函数:
def pwdist(features , names):
dm = pairwise_distances(features.to_list(),metric="hamming")
m,n = dm.shape
dm[:] = np.where(np.arange(m)[:,None] >= np.arange(n),np.nan,dm)
dm = pd.DataFrame(dm,index = names,columns = names)
out = dm.stack().reset_index()
out.columns = ['Bacteria1','Bacteria2','distance']
return out
使用groupby并应用函数:
df.groupby('Year').apply(lambda x: pwdist(x.Feature_Vector,x.Bacteria.values))
给我们这样的东西:
Bacteria1 Bacteria2 distance
Year
1968 0 XYRT23 XXQY12 0.333333
1 XYRT23 RTy11R 0.250000
2 XXQY12 RTy11R 0.416667
1969 0 XYRT23 XXQY12 0.500000
1 XYRT23 RTy11R 0.333333
2 XXQY12 RTy11R 0.166667