如何从 csv 文件中绘制列的直方图
how to plot a histogram of a column from a csv file
the sample file looks like thisx 轴应包含 a-z+A-Z 范围内的字母,y 轴应绘制内容列中各自的频率
import pandas as pd
import numpy as np
import string
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')
col_list = ["tweet_id","sentiment","author","content"]
df = pd.read_csv("sample.csv",usecols=col_list)
freq = (df["content"])
frequencies = {}
for sentence in freq:
for char in sentence:
if char in frequencies:
frequencies[char] += 1
else:
frequencies[char] = 1
frequency = str(frequencies)
bins = [chr(i + ord('a')) for i in range(26)].__add__([chr(j + ord('A')) for j in range(26)])
plt.title('data')
plt.xlabel('letters')
plt.ylabel('frequencies')
plt.hist(bins,frequency,edgecolor ='black')
plt.tight_layout()
plt.show()
您的代码已经结构良好,我仍然建议使用 plt.bar
和 xticks
上的字母而不是 plt.hist
,因为使用 [=15= 似乎更容易] 在 x 轴上。我对 else
进行了评论,以便除了所需的字母 (a-zA-Z
) 之外不会添加任何内容。还包括一个 sorted
命令以提供按字母顺序或频率计数对条形进行排序的选项。
sample.csv
中使用的输入
tweet_id sentiment author content
0 NaN NaN NaN @tiffanylue i know i was listenin to bad habit...
1 NaN NaN NaN Layin n bed with a headache ughhhh...waitin on...
2 NaN NaN NaN Funeral ceremony...gloomy friday...
3 NaN NaN NaN wants to hang out with friends SOON!
4 NaN NaN NaN @dannycastillo We want to trade with someone w...
5 NaN NaN NaN Re-pinging @ghostridahl4: why didn't you go to...
6 NaN NaN NaN I should be sleep, but im not! thinking about ...
...
...
# populate dictionary a-zA-Z with zeros
frequencies = {}
for i in range(26):
frequencies[chr(i + ord('a'))] = 0
frequencies[chr(i + ord('A'))] = 0
# iterate over each row of "content"
for row in df.loc[:,"content"]:
for char in row:
if char in frequencies:
frequencies[char] += 1
# uncomment to include numbers and symbols (!@#$...)
# else:
# frequencies[char] = 1
# sort items from highest count to lowest
char_freq = sorted(frequencies.items(), key=lambda x: x[1], reverse=True)
# char_freq = sorted(frequencies.items(), key=lambda x: x, reverse=False)
plt.title('data')
plt.xlabel('letters')
plt.ylabel('frequencies')
plt.bar(range(len(char_freq)), [i[1] for i in char_freq], align='center')
plt.xticks(range(len(char_freq)), [i[0] for i in char_freq])
plt.tight_layout()
plt.show()
the sample file looks like thisx 轴应包含 a-z+A-Z 范围内的字母,y 轴应绘制内容列中各自的频率
import pandas as pd
import numpy as np
import string
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')
col_list = ["tweet_id","sentiment","author","content"]
df = pd.read_csv("sample.csv",usecols=col_list)
freq = (df["content"])
frequencies = {}
for sentence in freq:
for char in sentence:
if char in frequencies:
frequencies[char] += 1
else:
frequencies[char] = 1
frequency = str(frequencies)
bins = [chr(i + ord('a')) for i in range(26)].__add__([chr(j + ord('A')) for j in range(26)])
plt.title('data')
plt.xlabel('letters')
plt.ylabel('frequencies')
plt.hist(bins,frequency,edgecolor ='black')
plt.tight_layout()
plt.show()
您的代码已经结构良好,我仍然建议使用 plt.bar
和 xticks
上的字母而不是 plt.hist
,因为使用 [=15= 似乎更容易] 在 x 轴上。我对 else
进行了评论,以便除了所需的字母 (a-zA-Z
) 之外不会添加任何内容。还包括一个 sorted
命令以提供按字母顺序或频率计数对条形进行排序的选项。
sample.csv
中使用的输入 tweet_id sentiment author content
0 NaN NaN NaN @tiffanylue i know i was listenin to bad habit...
1 NaN NaN NaN Layin n bed with a headache ughhhh...waitin on...
2 NaN NaN NaN Funeral ceremony...gloomy friday...
3 NaN NaN NaN wants to hang out with friends SOON!
4 NaN NaN NaN @dannycastillo We want to trade with someone w...
5 NaN NaN NaN Re-pinging @ghostridahl4: why didn't you go to...
6 NaN NaN NaN I should be sleep, but im not! thinking about ...
...
...
# populate dictionary a-zA-Z with zeros
frequencies = {}
for i in range(26):
frequencies[chr(i + ord('a'))] = 0
frequencies[chr(i + ord('A'))] = 0
# iterate over each row of "content"
for row in df.loc[:,"content"]:
for char in row:
if char in frequencies:
frequencies[char] += 1
# uncomment to include numbers and symbols (!@#$...)
# else:
# frequencies[char] = 1
# sort items from highest count to lowest
char_freq = sorted(frequencies.items(), key=lambda x: x[1], reverse=True)
# char_freq = sorted(frequencies.items(), key=lambda x: x, reverse=False)
plt.title('data')
plt.xlabel('letters')
plt.ylabel('frequencies')
plt.bar(range(len(char_freq)), [i[1] for i in char_freq], align='center')
plt.xticks(range(len(char_freq)), [i[0] for i in char_freq])
plt.tight_layout()
plt.show()