尝试对 python 中的数据集模式进行自己的实现

Question

我完全知道 Counter.most_commen 但这对我来说就像是在欺骗我。我想自己做。

这是我的功能。

    def mode(self):
        unq = []
        m = 0
        for i in self.arrData:
            if i not in unq:
                unq.append(i)
        for i in unq:
            count = self.arrData.count(i)
            if count > m:
                m = i
        return m

使用测试数据时：

34.9, 35.0, 35.2, 35.4, 35.8, 36.0, 36.1, 36.2, 36.3, 36.4, 36.4, 36.4, 36.4, 36.5, 36.6, 36.7, 36.7, 36.8, 36.8, 37.0, 37.2, 37.3, 37.9, 38.2, 38.3, 38.3, 38.4, 38.8, 39.0, 39.4

我一直在获取第一个元素作为m。

Answer 1

您需要维护两个变量——当前模式和当前模式的计数。您当前正在比较“计数”与“模式”，而您应该比较计数与模式的计数。

    def mode(self):
        uniq = set() # set is better than lists for this
        mode = None
        mode_count = 0
        for i in self.arrData:
            uniq.add(i) # don't need to check for membership with sets
        for i in uniq:
            i_count = self.arrData.count(i)
            if i_count > mode_count:
                mode = i
                mode_count = i_count
        return mode # will return None for an empty array

一次性完成（减少运行时间）：

    def mode(self):
        seen = set() # set is better than lists for this; checking membership is cheaper
        mode = None
        mode_count = 0
        for i in self.arrData:
            if i in seen:
                continue
            seen.add(i)
            i_count = self.arrData.count(i)
            if i_count > mode_count:
                mode = i
                mode_count = i_count
        return mode # will return None for an empty array

但这也隐藏了 arrData.count() 中的 O(n) 扫描。为了避免这种情况：

    def mode(self):
        value_counts = defaultdict(int)
        for i in self.arrData:
            value_counts[i] += 1
        # equivalently: value_counts = Counter(self.arrData)
        mode = None
        mode_count = 0
        for i, i_count in value_counts.items():
            if i_count > mode_count:
                mode = i
                mode_count = i_count
        return mode # will return None for an empty array

或者，使用 scipy.stats.mode（参见 Most efficient way to find mode in numpy array）。请注意，如果您的数据是连续的（浮点数通常是这种情况），您可能需要某种 kde 而不是模式（否则您隐式地将数据的精度作为量化仓的大小，当可能是不同的仓时大小/带宽对您的数据更敏感）。

Answer 2

您将最常见的值保存在 m 中，而不是它的计数。您可以通过此代码修复它：

def mode(self):
    unq = []
    m = 0
    c = 0
    for i in self.arrData:
        if i not in unq:
            unq.append(i)
    for i in unq:
        count = self.arrData.count(i)
        if count > c:
            m = i
            c = count
    return m

尝试对 python 中的数据集模式进行自己的实现

Trying to make my own implementation for the mode of a data set in python

python

data-science