将值附加到 CSV 中的新列
append values to the new columns in the CSV
我有两个CSV,一个是主数据,另一个是组件数据,主数据有两行两列,而组件数据有五行两列。
我试图在标记化、词干化和词形还原之后找到它们之间的余弦相似度,然后将相似度索引附加到新列,我无法将相应的值附加到需要进一步转换为 CSV 的数据帧。
我的方法:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\Similarity\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\Similarity\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
将 df 转换为 CSV 后的预期输出,
在将值附加到数据框时遇到问题,请帮助我解决这个问题。谢谢
这是我想出的:
示例设置
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
我们最终得到 2 个数据帧(显示 df_cd
):
SI.No
Sentences
0
1
Emma is writing a letter.
1
2
We wake up early in the morning.
2
3
Did Emma Write a letter?
3
4
We sleep early at night.
4
5
Emma wrote a letter.
我用以下虚拟函数替换了您使用的 2 个函数:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
正在处理数据
首先,我们在每个数据帧上应用 fetchLemmantizedWords
函数。句子的正则表达式替换、小写和拆分由 Pandas 完成,而不是在函数本身中完成。
通过让句子先小写,我们可以简化正则表达式只考虑小写字母。
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
df_cd
的结果:
SI.No
Sentences
lem
0
1
Emma is writing a letter.
[29, 5, 4, 9, 28]
1
2
We wake up early in the morning.
[16, 8, 21, 14, 13, 4, 6]
2
3
Did Emma Write a letter?
[30, 9, 23, 16, 5]
3
4
We sleep early at night.
[8, 25, 24, 7, 3]
4
5
Emma wrote a letter.
[30, 30, 15, 7]
接下来,我们使用 cross-join 制作一个数据框,其中包含 md
和 cd
数据的所有可能组合。
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged
内容:
SI.No_md
lem_md
SI.No_cd
lem_cd
0
1
[14, 22, 9, 21, 4]
1
[3, 4, 8, 17, 2]
1
1
[14, 22, 9, 21, 4]
2
[29, 3, 10, 2, 19, 18, 21]
2
1
[14, 22, 9, 21, 4]
3
[20, 22, 29, 4, 3]
3
1
[14, 22, 9, 21, 4]
4
[17, 7, 1, 27, 19]
4
1
[14, 22, 9, 21, 4]
5
[17, 5, 3, 29]
5
2
[12, 30, 10, 11, 7, 11, 8]
1
[3, 4, 8, 17, 2]
6
2
[12, 30, 10, 11, 7, 11, 8]
2
[29, 3, 10, 2, 19, 18, 21]
7
2
[12, 30, 10, 11, 7, 11, 8]
3
[20, 22, 29, 4, 3]
8
2
[12, 30, 10, 11, 7, 11, 8]
4
[17, 7, 1, 27, 19]
9
2
[12, 30, 10, 11, 7, 11, 8]
5
[17, 5, 3, 29]
接下来,我们计算余弦值:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
在最后一步中,我们对数据进行透视并将原始 df_cd
与计算结果合并:
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
结果(同样,这些是虚拟计算):
SI.No
Sentences
1
2
1
Emma is writing a letter.
100
64
2
We wake up early in the morning.
63
100
3
Did Emma Write a letter?
100
5
4
We sleep early at night.
100
17
5
Emma wrote a letter.
35
9
我有两个CSV,一个是主数据,另一个是组件数据,主数据有两行两列,而组件数据有五行两列。
我试图在标记化、词干化和词形还原之后找到它们之间的余弦相似度,然后将相似度索引附加到新列,我无法将相应的值附加到需要进一步转换为 CSV 的数据帧。
我的方法:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\Similarity\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\Similarity\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
将 df 转换为 CSV 后的预期输出,
在将值附加到数据框时遇到问题,请帮助我解决这个问题。谢谢
这是我想出的:
示例设置
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
我们最终得到 2 个数据帧(显示 df_cd
):
SI.No | Sentences | |
---|---|---|
0 | 1 | Emma is writing a letter. |
1 | 2 | We wake up early in the morning. |
2 | 3 | Did Emma Write a letter? |
3 | 4 | We sleep early at night. |
4 | 5 | Emma wrote a letter. |
我用以下虚拟函数替换了您使用的 2 个函数:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
正在处理数据
首先,我们在每个数据帧上应用 fetchLemmantizedWords
函数。句子的正则表达式替换、小写和拆分由 Pandas 完成,而不是在函数本身中完成。
通过让句子先小写,我们可以简化正则表达式只考虑小写字母。
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
df_cd
的结果:
SI.No | Sentences | lem | |
---|---|---|---|
0 | 1 | Emma is writing a letter. | [29, 5, 4, 9, 28] |
1 | 2 | We wake up early in the morning. | [16, 8, 21, 14, 13, 4, 6] |
2 | 3 | Did Emma Write a letter? | [30, 9, 23, 16, 5] |
3 | 4 | We sleep early at night. | [8, 25, 24, 7, 3] |
4 | 5 | Emma wrote a letter. | [30, 30, 15, 7] |
接下来,我们使用 cross-join 制作一个数据框,其中包含 md
和 cd
数据的所有可能组合。
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged
内容:
SI.No_md | lem_md | SI.No_cd | lem_cd | |
---|---|---|---|---|
0 | 1 | [14, 22, 9, 21, 4] | 1 | [3, 4, 8, 17, 2] |
1 | 1 | [14, 22, 9, 21, 4] | 2 | [29, 3, 10, 2, 19, 18, 21] |
2 | 1 | [14, 22, 9, 21, 4] | 3 | [20, 22, 29, 4, 3] |
3 | 1 | [14, 22, 9, 21, 4] | 4 | [17, 7, 1, 27, 19] |
4 | 1 | [14, 22, 9, 21, 4] | 5 | [17, 5, 3, 29] |
5 | 2 | [12, 30, 10, 11, 7, 11, 8] | 1 | [3, 4, 8, 17, 2] |
6 | 2 | [12, 30, 10, 11, 7, 11, 8] | 2 | [29, 3, 10, 2, 19, 18, 21] |
7 | 2 | [12, 30, 10, 11, 7, 11, 8] | 3 | [20, 22, 29, 4, 3] |
8 | 2 | [12, 30, 10, 11, 7, 11, 8] | 4 | [17, 7, 1, 27, 19] |
9 | 2 | [12, 30, 10, 11, 7, 11, 8] | 5 | [17, 5, 3, 29] |
接下来,我们计算余弦值:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
在最后一步中,我们对数据进行透视并将原始 df_cd
与计算结果合并:
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
结果(同样,这些是虚拟计算):
SI.No | Sentences | 1 | 2 |
---|---|---|---|
1 | Emma is writing a letter. | 100 | 64 |
2 | We wake up early in the morning. | 63 | 100 |
3 | Did Emma Write a letter? | 100 | 5 |
4 | We sleep early at night. | 100 | 17 |
5 | Emma wrote a letter. | 35 | 9 |