使用 pandas & BERT 从一个数据帧到另一个数据帧循环余弦相似度公式
Looping cosine similarity formula from one dataframe to another dataframe using pandas & BERT
我正在构建一个 NLP 项目,该项目比较两个不同数据帧之间的句子相似性。这是数据帧的示例:
df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent',
'this too is a sentence',
'this is very different',
'another sentence is this',
'not much of anything']
})
我目前的代码设置方式是将 df 中的第一个单元格与 df1 中的所有单元格进行比较。然后它选择最高的余弦相似度分数并将其放入单独的数据框中,代码如下:
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
[sentence_vecs[0]],
sentence_vecs1[0:]
)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
最终产品是这个数据框:
XNew_ID X0
0 1 0.615005
我如何编写一段代码,使其遍历 df 中的其余元素并以相同的方式写入 'Final' 数据帧?
Cosign similarity 可以在两个列表上表现良好,因此您可以将整个嵌入列表作为参数传递,然后提取最大相似度。
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df1['Element Detail'])
sentence_vecs1 = model.encode(df2['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
sentence_vecs,
sentence_vecs1
)
max_similarities = np.amax(new, axis=1)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
Final
输出:
XNew_ID X0 X1
0 1 0.615005 0.868932
我正在构建一个 NLP 项目,该项目比较两个不同数据帧之间的句子相似性。这是数据帧的示例:
df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent',
'this too is a sentence',
'this is very different',
'another sentence is this',
'not much of anything']
})
我目前的代码设置方式是将 df 中的第一个单元格与 df1 中的所有单元格进行比较。然后它选择最高的余弦相似度分数并将其放入单独的数据框中,代码如下:
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
[sentence_vecs[0]],
sentence_vecs1[0:]
)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
最终产品是这个数据框:
XNew_ID X0
0 1 0.615005
我如何编写一段代码,使其遍历 df 中的其余元素并以相同的方式写入 'Final' 数据帧?
Cosign similarity 可以在两个列表上表现良好,因此您可以将整个嵌入列表作为参数传递,然后提取最大相似度。
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df1['Element Detail'])
sentence_vecs1 = model.encode(df2['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
sentence_vecs,
sentence_vecs1
)
max_similarities = np.amax(new, axis=1)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
Final
输出:
XNew_ID X0 X1
0 1 0.615005 0.868932