尝试对 python 中的数据集模式进行自己的实现
Trying to make my own implementation for the mode of a data set in python
我完全知道 Counter.most_commen
但这对我来说就像是在欺骗我。我想自己做。
这是我的功能。
def mode(self):
unq = []
m = 0
for i in self.arrData:
if i not in unq:
unq.append(i)
for i in unq:
count = self.arrData.count(i)
if count > m:
m = i
return m
使用测试数据时:
34.9, 35.0, 35.2, 35.4, 35.8, 36.0, 36.1, 36.2, 36.3, 36.4, 36.4, 36.4, 36.4, 36.5, 36.6, 36.7, 36.7, 36.8, 36.8, 37.0, 37.2, 37.3, 37.9, 38.2, 38.3, 38.3, 38.4, 38.8, 39.0, 39.4
我一直在获取第一个元素作为m。
您需要维护两个变量——当前模式和当前模式的计数。您当前正在比较“计数”与“模式”,而您应该比较计数与模式的计数。
def mode(self):
uniq = set() # set is better than lists for this
mode = None
mode_count = 0
for i in self.arrData:
uniq.add(i) # don't need to check for membership with sets
for i in uniq:
i_count = self.arrData.count(i)
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
一次性完成(减少 运行 时间):
def mode(self):
seen = set() # set is better than lists for this; checking membership is cheaper
mode = None
mode_count = 0
for i in self.arrData:
if i in seen:
continue
seen.add(i)
i_count = self.arrData.count(i)
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
但这也隐藏了 arrData.count() 中的 O(n) 扫描。为了避免这种情况:
def mode(self):
value_counts = defaultdict(int)
for i in self.arrData:
value_counts[i] += 1
# equivalently: value_counts = Counter(self.arrData)
mode = None
mode_count = 0
for i, i_count in value_counts.items():
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
或者,使用 scipy.stats.mode
(参见 Most efficient way to find mode in numpy array)。请注意,如果您的数据是连续的(浮点数通常是这种情况),您可能需要某种 kde 而不是模式(否则您隐式地将数据的精度作为量化仓的大小,当可能是不同的仓时大小/带宽对您的数据更敏感)。
您将最常见的值保存在 m
中,而不是它的计数。
您可以通过此代码修复它:
def mode(self):
unq = []
m = 0
c = 0
for i in self.arrData:
if i not in unq:
unq.append(i)
for i in unq:
count = self.arrData.count(i)
if count > c:
m = i
c = count
return m
我完全知道 Counter.most_commen
但这对我来说就像是在欺骗我。我想自己做。
这是我的功能。
def mode(self):
unq = []
m = 0
for i in self.arrData:
if i not in unq:
unq.append(i)
for i in unq:
count = self.arrData.count(i)
if count > m:
m = i
return m
使用测试数据时:
34.9, 35.0, 35.2, 35.4, 35.8, 36.0, 36.1, 36.2, 36.3, 36.4, 36.4, 36.4, 36.4, 36.5, 36.6, 36.7, 36.7, 36.8, 36.8, 37.0, 37.2, 37.3, 37.9, 38.2, 38.3, 38.3, 38.4, 38.8, 39.0, 39.4
我一直在获取第一个元素作为m。
您需要维护两个变量——当前模式和当前模式的计数。您当前正在比较“计数”与“模式”,而您应该比较计数与模式的计数。
def mode(self):
uniq = set() # set is better than lists for this
mode = None
mode_count = 0
for i in self.arrData:
uniq.add(i) # don't need to check for membership with sets
for i in uniq:
i_count = self.arrData.count(i)
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
一次性完成(减少 运行 时间):
def mode(self):
seen = set() # set is better than lists for this; checking membership is cheaper
mode = None
mode_count = 0
for i in self.arrData:
if i in seen:
continue
seen.add(i)
i_count = self.arrData.count(i)
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
但这也隐藏了 arrData.count() 中的 O(n) 扫描。为了避免这种情况:
def mode(self):
value_counts = defaultdict(int)
for i in self.arrData:
value_counts[i] += 1
# equivalently: value_counts = Counter(self.arrData)
mode = None
mode_count = 0
for i, i_count in value_counts.items():
if i_count > mode_count:
mode = i
mode_count = i_count
return mode # will return None for an empty array
或者,使用 scipy.stats.mode
(参见 Most efficient way to find mode in numpy array)。请注意,如果您的数据是连续的(浮点数通常是这种情况),您可能需要某种 kde 而不是模式(否则您隐式地将数据的精度作为量化仓的大小,当可能是不同的仓时大小/带宽对您的数据更敏感)。
您将最常见的值保存在 m
中,而不是它的计数。
您可以通过此代码修复它:
def mode(self):
unq = []
m = 0
c = 0
for i in self.arrData:
if i not in unq:
unq.append(i)
for i in unq:
count = self.arrData.count(i)
if count > c:
m = i
c = count
return m