按降序计算值列表的频率及其相关的较高值百分比

Calculate frequency of a list of values in descending order and its associated percentage of values higher

我正在尝试编写一个 Python 代码,用于按降序计算给定值列表 (y) 每个 y 值的频率以及具有较大 y 值的样本 (yi) 的相关百分比考虑到频率。

非常感谢! 这是我使用 NumPy 编写的 Python 代码,但我在计算百分比和计算频率时遇到了一些错误,我希望它与新的 y 值数组保持一致而不重复(arr)

# Permeability values (mD)
y = [27.10, 23.02, 18.26, 17.46, 16.88, 15.75, 15.21, 12.65, 12.65, 12.65, 12.65,  14.93, 13.88, 13.53, 13.31, 13.27, 12.65, 12.41, 11.97, 11.93, 11.84, 11.82, 27.10, 27.10, 27.10, 11.12, 11.10, 10.65, 10.54, 10.29, 9.98, 9.19, 9.03, 8.56, 8.28, 8.21, 9.98, 9.98, 11.97, 11.97, 11.97, 4.68, 4.37, 3.82, 3.44, 3.38, 3.33, 3.27, 3.22, 2.52, 2.38, 1.91, 1.89, 1.87, 1.81, 1.00, 13.27, 13.27, 9.98, 13.27, 9.98, 13.27, 9.98, 13.27]

# Permeability values in descending order (y, mD)
y_sorted = sorted(y, reverse=True)

# Calculate frequency for the permeability values in descending order
y_new_sorted = np.array(y_sorted)
arr,count = np.unique(y_new_sorted,return_counts=True)
arr_sorted = sorted(arr, reverse=True)
print('Frequency= ', count)
print('Permeability values in descending order without repititions= ', arr_sorted)

# Percentage of samples with larger permeability (x, %)
vec_percent = np.vectorize(percent)
np.unique(vec_percent(y_new_sorted))
print('Percentage of samples with larger permeability= ', vec_percent)
     
**OUTPUTS**

Frequency=  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 4 1 5 6 1 1 1 1
 1 1 1 1 1 1 4]

Permeability values in descending order without repititions=  [27.1, 23.02, 18.26, 17.46, 16.88, 15.75, 15.21, 14.93, 13.88, 13.53, 13.31, 13.27, 12.65, 12.41, 11.97, 11.93, 11.84, 11.82, 11.12, 11.1, 10.65, 10.54, 10.29, 9.98, 9.19, 9.03, 8.56, 8.28, 8.21, 4.68, 4.37, 3.82, 3.44, 3.38, 3.33, 3.27, 3.22, 2.52, 2.38, 1.91, 1.89, 1.87, 1.81, 1.0]

Traceback (most recent call last):
  File line 22, in <module>
    vec_percent = np.vectorize(percent)
NameError: name 'percent' is not defined

Process finished with exit code 1

基本

list.count(item) 函数 returns 可以在 list 中找到 item 的次数。 list.index(item) 函数 returns 列表中第一个 item 的位置,恰好是它之前的元素数(因为 python 从 0 开始索引列表)由于它以递减的方式排序,因此这恰好是较高值的数量。

y = [390, 390, 390, 390, 390, 370, 370, 350, 330, 330, 330, 330, 330, 330, 310, 310, 310, 310, 290]

def freq(item, lst):
    return lst.count(item)

def higher_perc(item, lst):
    return lst.index(item) / len(lst)

print(freq(370, y))  # 2
print(higher_perc(370, y))  # 0.2631578947368421

如果我们想将它应用于多个值,我们可以创建一个函数,returns 一个应用该操作的函数,然后使用 map:

y = [390, 390, 390, 390, 390, 370, 370, 350, 330, 330, 330, 330, 330, 330, 310, 310, 310, 310, 290]
items = sorted(set(y), reverse=True)

def create_freq_function(lst):
    def freq(item):
        return lst.count(item)
    return freq

def create_higher_perc_function(lst):
    def higher_perc(item):
        return lst.index(item) / len(lst)
    return higher_perc

print(items)
# [390, 370, 350, 330, 310, 290]
print(list(map(create_freq_function(y), items))
# [5, 2, 1, 6, 4, 1]
print(list(map(create_higher_perc_function(y), items))
# [0.0, 0.2631578947368421, 0.3684210526315789, 0.42105263157894735, 0.7368421052631579, 0.9473684210526315]

麻木

如果数据集太大,numpy 包会有所帮助。 numpy.unique 既可以获取唯一项目的列表,也可以获取它们出现的次数,而 numpy.cumsum 可以累积单个元素的百分比。

import numpy as np

y = np.array([390, 390, 390, 390, 390, 370, 370, 350, 330, 330, 330, 330, 330, 330, 310, 310, 310, 310, 290])

items, freqs = np.unique(y, return_counts=True)
items, freqs = items[::-1], freqs[::-1]
perc_freqs = freqs/len(y)
higher_percs = np.cumsum(perc_freqs) - perc_freqs

print(items)
# [390 370 350 330 310 290]
print(freqs)
# [5 2 1 6 4 1]
print(higher_percs)
# [0.         0.26315789 0.36842105 0.42105263 0.73684211 0.94736842]

您可以使用此函数进行频率计算:

def frequencies(values, display_flag=True):
  freq = {}
  for val in values:
    if str(val) in freq:
      freq[str(val)] += 1
    else:
      freq[str(val)] = 1
  
  # Displaying frequencies
  if display_flag:
    for i in (sorted (freq.keys())) : 
      print("Frequency of " + i + " is : " + str(freq[i]))

  return freq

您可以使用此函数计算百分比:

def percentages(values):
  freq = frequencies(values, False)
  total = len(values)
  current = 0

  for i in (sorted (freq.keys())) :
    temp = freq[i]/total
    print("Percentage of " + i + " is : " + str(current + temp))
    current += temp

请注意 percentages 函数与 frequencies 函数一起使用

有两种方式,使用传统的list或者使用高效的numpy:

使用列表

>>> y = [390, 390, 390, 390, 390, 370, 370, 350, 330, 330, 330, 330, 330, 330, 310, 310, 310, 310, 290]
#declare a lambda function to calculate percentage and frequency
>>> freq = lambda x: y.count(x)
>>> percent = lambda z: y.index(z)/len(y)
#after this using map() and mapping over only unique values rather than all
>>> print(list(map(freq,set(y))))
[1, 5, 6, 2, 4, 1]
>>> print(list(map(percent,set(y))))
[0.9473684210526315, 0.0, 0.42105263157894735, 0.2631578947368421, 0.7368421052631579, 0.3684210526315789]
>>> set(y)
{290, 390, 330, 370, 310, 350}
#frequency and percent corresponds here to respective values

使用 Numpy

我建议使用它,因为它快速高效,但只有当您有相对较大的数据集要处理时,您才会看到更好的结果。

>>> import numpy as np
>>> y_new = np.array(y)
>>> arr,count = np.unique(y_new,return_counts=True) #very simple approach to get output
>>> count
array([1, 4, 6, 1, 2, 5])
>>> arr
array([290, 310, 330, 350, 370, 390])
#defining vectorized percentage function refering to what defined previously
>>> vec_percent = np.vectorize(percent)
>>> np.unique(vec_percent(y_new))
array([0.        , 0.26315789, 0.36842105, 0.42105263, 0.73684211,
       0.94736842])
#you get your percentages

现在由您决定使用什么。