如何计算单词及其相关组?
How to count words and their associated groups?
我想计算一个特定主题在很长的单词列表中出现的次数。目前,我有一个字典字典,其中外键是主题,内键是该主题的关键字。
我正在尝试有效地计算关键字出现次数并维护其相应主题出现次数的累计总和。
最终,我想保存多个文本的输出。这是我目前已实施的示例。我遇到的问题是它非常慢,并且它不会将关键字计数存储在输出 DataFrame 中。是否有解决这些问题的替代方案?
import pandas as pd
topics = {
"mathematics": {
"analysis": 0,
"algebra": 0,
"logic": 0
},
"philosophy": {
"ethics": 0,
"metaphysics": 0,
"epistemology": 0
}
}
texts = {
"text_a": [
"the", "major", "areas", "of", "study", "in", "mathematics", "are",
"analysis", "algebra", "and", "logic", "in", "philosophy", "they",
"are", "ethics", "metaphysics", "and", "epistemology"
],
"text_b": [
"logic", "is", "studied", "both", "in", "mathematics", "and",
"philosophy"
]
}
topics_by_text = pd.DataFrame()
for title, text in texts.items():
topic_count = {}
for topic, sub_dict in topics.items():
curr_topic_counter = 0
for keyword, count in sub_dict.items():
keyword_occurrences = text.count(keyword)
topics[topic][keyword] = keyword_occurrences
curr_topic_counter += keyword_occurrences
topic_count[topic] = curr_topic_counter
topics_by_text[title] = pd.Series(topic_count)
print(topics_by_text)
不确定速度,但以下代码以整洁的 MultiIndexed 方式存储关键字计数。
# Returns a count dictionary
def CountFrequency(my_list, keyword):
freq = {}
for item in my_list:
freq[item] = 0
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
dict_ = {}
for your_key,value in keyword.items():
try:
dict_.update({your_key: freq[your_key]})
except:
dict_.update({your_key: 0})
dict_['count'] = sum([value if (value != None) else 0 for value in dict_.values()])
return dict_
# Calculates count
output = {}
for key, value in texts.items():
for topic, keywords in topics.items():
try:
output[topic][key] = CountFrequency(value,keywords)
except KeyError:
output[topic] = {}
output[topic][key] = CountFrequency(value,keywords)
# To DataFrame
dict_of_df = {k: pd.DataFrame(v) for k,v in output.items()}
df = pd.concat(dict_of_df, axis=0)
df.T
我想计算一个特定主题在很长的单词列表中出现的次数。目前,我有一个字典字典,其中外键是主题,内键是该主题的关键字。
我正在尝试有效地计算关键字出现次数并维护其相应主题出现次数的累计总和。
最终,我想保存多个文本的输出。这是我目前已实施的示例。我遇到的问题是它非常慢,并且它不会将关键字计数存储在输出 DataFrame 中。是否有解决这些问题的替代方案?
import pandas as pd
topics = {
"mathematics": {
"analysis": 0,
"algebra": 0,
"logic": 0
},
"philosophy": {
"ethics": 0,
"metaphysics": 0,
"epistemology": 0
}
}
texts = {
"text_a": [
"the", "major", "areas", "of", "study", "in", "mathematics", "are",
"analysis", "algebra", "and", "logic", "in", "philosophy", "they",
"are", "ethics", "metaphysics", "and", "epistemology"
],
"text_b": [
"logic", "is", "studied", "both", "in", "mathematics", "and",
"philosophy"
]
}
topics_by_text = pd.DataFrame()
for title, text in texts.items():
topic_count = {}
for topic, sub_dict in topics.items():
curr_topic_counter = 0
for keyword, count in sub_dict.items():
keyword_occurrences = text.count(keyword)
topics[topic][keyword] = keyword_occurrences
curr_topic_counter += keyword_occurrences
topic_count[topic] = curr_topic_counter
topics_by_text[title] = pd.Series(topic_count)
print(topics_by_text)
不确定速度,但以下代码以整洁的 MultiIndexed 方式存储关键字计数。
# Returns a count dictionary
def CountFrequency(my_list, keyword):
freq = {}
for item in my_list:
freq[item] = 0
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
dict_ = {}
for your_key,value in keyword.items():
try:
dict_.update({your_key: freq[your_key]})
except:
dict_.update({your_key: 0})
dict_['count'] = sum([value if (value != None) else 0 for value in dict_.values()])
return dict_
# Calculates count
output = {}
for key, value in texts.items():
for topic, keywords in topics.items():
try:
output[topic][key] = CountFrequency(value,keywords)
except KeyError:
output[topic] = {}
output[topic][key] = CountFrequency(value,keywords)
# To DataFrame
dict_of_df = {k: pd.DataFrame(v) for k,v in output.items()}
df = pd.concat(dict_of_df, axis=0)
df.T