将值附加到 CSV 中的新列

append values to the new columns in the CSV

我有两个CSV,一个是主数据,另一个是组件数据,主数据有两行两列,而组件数据有五行两列。

我试图在标记化、词干化和词形还原之后找到它们之间的余弦相似度,然后将相似度索引附加到新列,我无法将相应的值附加到需要进一步转换为 CSV 的数据帧。

我的方法:

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd

portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []

def fetchLemmantizedWords():
    eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
    convertLowerCase = eliminatePunctuation.lower()
    tokenizeData = convertLowerCase.split()
    eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
    stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
    wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
    return wordLemmatization

def fetchCosine(eachMasterData,eachComponentData):
    masterDataValues = Counter(eachMasterData)
    componentDataValues = Counter(eachComponentData)
    bagOfWords  = list(masterDataValues.keys() | componentDataValues.keys())
    masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]    
    componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]          
    masterDataLength  = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5                
    componentDataLength  = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5         
    dotProduct    = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))      
    cosine = int((dotProduct / (masterDataLength * componentDataLength))*100) 
    return cosine

masterData = pd.read_csv('C:\Similarity\MasterData.csv', skipinitialspace=True)
componentData =  pd.read_csv('C:\Similarity\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
    eachMasterData = fetchLemmantizedWords()
    for value in componentData['Sentences']:
        eachComponentData = fetchLemmantizedWords()
        cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
        cosineSimilarityList.append(cosineSimilarity)
    for value in cosineSimilarityList:
        componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
        #componentData['Cosine Similarity'] = value

将 df 转换为 CSV 后的预期输出,

在将值附加到数据框时遇到问题,请帮助我解决这个问题。谢谢

这是我想出的:

示例设置

csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""

csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""

import pandas as pd
from io import StringIO

df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')

我们最终得到 2 个数据帧(显示 df_cd):

SI.No Sentences
0 1 Emma is writing a letter.
1 2 We wake up early in the morning.
2 3 Did Emma Write a letter?
3 4 We sleep early at night.
4 5 Emma wrote a letter.

我用以下虚拟函数替换了您使用的 2 个函数:

import random

def fetchLemmantizedWords(words):
    return [random.randint(1,30) for x in  words]

def fetchCosine(lem_md, lem_cd):
    return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)

正在处理数据

首先,我们在每个数据帧上应用 fetchLemmantizedWords 函数。句子的正则表达式替换、小写和拆分由 Pandas 完成,而不是在函数本身中完成。

通过让句子先小写,我们可以简化正则表达式只考虑小写字母。

for df in (df_md, df_cd):
    df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
                                                         .lower()
                                                         .replace(r'[^a-z]', ' ')
                                                         .split()), 
                         result_type='reduce', 
                         axis=1)

df_cd的结果:

SI.No Sentences lem
0 1 Emma is writing a letter. [29, 5, 4, 9, 28]
1 2 We wake up early in the morning. [16, 8, 21, 14, 13, 4, 6]
2 3 Did Emma Write a letter? [30, 9, 23, 16, 5]
3 4 We sleep early at night. [8, 25, 24, 7, 3]
4 5 Emma wrote a letter. [30, 30, 15, 7]

接下来,我们使用 cross-join 制作一个数据框,其中包含 mdcd 数据的所有可能组合。

df_merged = pd.merge(df_md[['SI.No', 'lem']], 
                     df_cd[['SI.No', 'lem']], 
                     how='cross', 
                     suffixes=('_md','_cd')
                    )

df_merged内容:

SI.No_md lem_md SI.No_cd lem_cd
0 1 [14, 22, 9, 21, 4] 1 [3, 4, 8, 17, 2]
1 1 [14, 22, 9, 21, 4] 2 [29, 3, 10, 2, 19, 18, 21]
2 1 [14, 22, 9, 21, 4] 3 [20, 22, 29, 4, 3]
3 1 [14, 22, 9, 21, 4] 4 [17, 7, 1, 27, 19]
4 1 [14, 22, 9, 21, 4] 5 [17, 5, 3, 29]
5 2 [12, 30, 10, 11, 7, 11, 8] 1 [3, 4, 8, 17, 2]
6 2 [12, 30, 10, 11, 7, 11, 8] 2 [29, 3, 10, 2, 19, 18, 21]
7 2 [12, 30, 10, 11, 7, 11, 8] 3 [20, 22, 29, 4, 3]
8 2 [12, 30, 10, 11, 7, 11, 8] 4 [17, 7, 1, 27, 19]
9 2 [12, 30, 10, 11, 7, 11, 8] 5 [17, 5, 3, 29]

接下来,我们计算余弦值:

df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md, 
                                                            x.lem_cd), 
                                      axis=1)

在最后一步中,我们对数据进行透视并将原始 df_cd 与计算结果合并:

pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
         df_merged.pivot_table(index='SI.No_cd', 
                               columns='SI.No_md').droplevel(0, axis=1),
         how='inner',
         left_index=True, 
         right_index=True)

结果(同样,这些是虚拟计算):

SI.No Sentences 1 2
1 Emma is writing a letter. 100 64
2 We wake up early in the morning. 63 100
3 Did Emma Write a letter? 100 5
4 We sleep early at night. 100 17
5 Emma wrote a letter. 35 9