使用文件标记分析情绪
Labelling for analysis sentiment with file
我有一个数据叫:
- after_tokenize.xlsx
- positive.xlsx
- negative.xlsx
after tokenize
positive
negative
我想要为来自 after_tokenize.xlsx 的数据标记正面和负面情绪。如果标记化后的数据有很多来自数据 positive.xlsx 的正面词,它将是正面的,如果数据有很多来自负面的负面词,它将是负面的。结果将被输入到名为label 的标签中。
样本:
data
label
[i, like, love, hate, you]
positive
[i, worst, hate, like, you]
negative
import pandas as pd
import nltk
df = pd.DataFrame({'data': ['i like love hate you', 'i dont hate like you']})
pos = pd.DataFrame(data=['like', 'love'], columns=['positive'])
neg = pd.DataFrame(data=['dont', 'hate'], columns=['negative'])
df['data'] = df.apply(lambda row: nltk.word_tokenize(row['data']), axis=1)
您可以使用 set()
和操作 set(...) & set(...)
来获取两个列表中的单词。
然后您可以使用 len()
来计算它们
len( set([i, like, love, hate, you]) & set(['like', 'love']) )
import pandas as pd
import nltk
df = pd.DataFrame({'data': ['i like love hate you', 'i dont hate like you']})
pos = ['like', 'love']
neg = ['dont', 'hate']
#print(df)
df['data'] = df['data'].apply(nltk.word_tokenize)
# --- get common words ---
df['pos words'] = df['data'].apply(lambda item: list(set(item) & set(pos)))
df['neg words'] = df['data'].apply(lambda item: list(set(item) & set(neg)))
# --- count common words ---
df['pos'] = df['data'].apply(lambda item: len(set(item) & set(pos)))
df['neg'] = df['data'].apply(lambda item: len(set(item) & set(neg)))
# or
df['pos'] = df['pos words'].apply(len)
df['neg'] = df['neg words'].apply(len)
# --- assing labels ---
df['label'] = '???' # default value
#df.['label'][ df['pos'] > df['neg'] ] = 'positive'
df.loc[ (df['pos'] > df['neg']), 'label' ] = 'positive'
#df.['label'][ df['pos'] < df['neg'] ] = 'negative'
df.loc[ (df['pos'] < df['neg']), 'label' ] = 'negative'
# ---
print(df)
结果:
data pos words neg words pos neg label
0 [i, like, love, hate, you] [love, like] [hate] 2 1 positive
1 [i, dont, hate, like, you] [like] [hate, dont] 1 2 negative
我有一个数据叫:
- after_tokenize.xlsx
- positive.xlsx
- negative.xlsx after tokenize positive negative
我想要为来自 after_tokenize.xlsx 的数据标记正面和负面情绪。如果标记化后的数据有很多来自数据 positive.xlsx 的正面词,它将是正面的,如果数据有很多来自负面的负面词,它将是负面的。结果将被输入到名为label 的标签中。 样本:
data | label |
---|---|
[i, like, love, hate, you] | positive |
[i, worst, hate, like, you] | negative |
import pandas as pd
import nltk
df = pd.DataFrame({'data': ['i like love hate you', 'i dont hate like you']})
pos = pd.DataFrame(data=['like', 'love'], columns=['positive'])
neg = pd.DataFrame(data=['dont', 'hate'], columns=['negative'])
df['data'] = df.apply(lambda row: nltk.word_tokenize(row['data']), axis=1)
您可以使用 set()
和操作 set(...) & set(...)
来获取两个列表中的单词。
然后您可以使用 len()
len( set([i, like, love, hate, you]) & set(['like', 'love']) )
import pandas as pd
import nltk
df = pd.DataFrame({'data': ['i like love hate you', 'i dont hate like you']})
pos = ['like', 'love']
neg = ['dont', 'hate']
#print(df)
df['data'] = df['data'].apply(nltk.word_tokenize)
# --- get common words ---
df['pos words'] = df['data'].apply(lambda item: list(set(item) & set(pos)))
df['neg words'] = df['data'].apply(lambda item: list(set(item) & set(neg)))
# --- count common words ---
df['pos'] = df['data'].apply(lambda item: len(set(item) & set(pos)))
df['neg'] = df['data'].apply(lambda item: len(set(item) & set(neg)))
# or
df['pos'] = df['pos words'].apply(len)
df['neg'] = df['neg words'].apply(len)
# --- assing labels ---
df['label'] = '???' # default value
#df.['label'][ df['pos'] > df['neg'] ] = 'positive'
df.loc[ (df['pos'] > df['neg']), 'label' ] = 'positive'
#df.['label'][ df['pos'] < df['neg'] ] = 'negative'
df.loc[ (df['pos'] < df['neg']), 'label' ] = 'negative'
# ---
print(df)
结果:
data pos words neg words pos neg label
0 [i, like, love, hate, you] [love, like] [hate] 2 1 positive
1 [i, dont, hate, like, you] [like] [hate, dont] 1 2 negative