Graphlab:如何避免手动复制只有不同字符串变量的函数?
Graphlab: How to avoid manually duplicating functions that has only a different string variable?
我用 SFrame 导入了我的数据集:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
我想对下面显示的一组词进行情感分析:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
然后我想为产品矩阵中的每个选定单词创建一个新列,条目是该单词出现的次数,因此我为单词创建了一个函数 "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
到目前为止一切顺利,但我需要以这种方式为每个选定的单词手动创建其他函数,例如 great_count 等。如何避免这种手动工作并编写更清晰的代码?
我认为 SFrame.unpack
命令应该可以解决问题。事实上,limit
参数将接受您选择的单词列表并仅保留这些结果,因此这部分大大简化了。
我不知道你们的评论数据到底是什么,所以我做了一个玩具示例:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
我也不得不指出 GraphLab Create 确实有 its own sentiment analysis toolkit,值得一试。
我实际上找到了一种更简单的方法:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
我用 SFrame 导入了我的数据集:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
我想对下面显示的一组词进行情感分析:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
然后我想为产品矩阵中的每个选定单词创建一个新列,条目是该单词出现的次数,因此我为单词创建了一个函数 "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
到目前为止一切顺利,但我需要以这种方式为每个选定的单词手动创建其他函数,例如 great_count 等。如何避免这种手动工作并编写更清晰的代码?
我认为 SFrame.unpack
命令应该可以解决问题。事实上,limit
参数将接受您选择的单词列表并仅保留这些结果,因此这部分大大简化了。
我不知道你们的评论数据到底是什么,所以我做了一个玩具示例:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
我也不得不指出 GraphLab Create 确实有 its own sentiment analysis toolkit,值得一试。
我实际上找到了一种更简单的方法:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))