从 R 中的整数向量计算一些困难的指标

Computing a few difficult metrics from an integer vector in R

在某些情况下,我正在处理体育/篮球数据。以下向量适用于 1 场 NBA 比赛,包含主队在比赛中任何给定时刻领先或落后的分数。

dput(leads_vector)
c(0, 0, 0, 0, 0, 0, 0, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 
-2, -2, -2, -2, -2, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 4, 2, 
5, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 
10, 10, 10, 10, 10, 10, 10, 11, 11, 9, 9, 9, 9, 9, 9, 9, 9, 11, 
11, 9, 9, 9, 11, 11, 11, 11, 12, 13, 13, 13, 13, 13, 13, 15, 
14, 14, 13, 13, 13, 13, 11, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 16, 
16, 13, 13, 11, 11, 11, 11, 11, 9, 9, 9, 7, 9, 9, 9, 10, 10, 
11, 11, 11, 11, 11, 11, 13, 13, 13, 13, 13, 11, 11, 11, 11, 11, 
12, 13, 13, 13, 13, 13, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
11, 11, 12, 13, 13, 13, 13, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
12, 12, 12, 12, 15, 15, 15, 13, 13, 13, 13, 15, 12, 12, 12, 9, 
9, 9, 9, 9, 11, 11, 11, 11, 13, 13, 10, 10, 10, 8, 8, 8, 8, 8, 
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
8, 8, 8, 10, 8, 7, 7, 7, 7, 7, 7, 7, 7, 8, 9, 9, 9, 11, 12, 12, 
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 10, 12, 10, 12, 12, 12, 
12, 14, 14, 14, 12, 12, 12, 12, 12, 12, 12, 12, 14, 14, 14, 15, 
16, 16, 16, 16, 14, 14, 11, 11, 11, 11, 11, 11, 9, 9, 9, 9, 9, 
9, 9, 10, 11, 11, 9, 9, 9, 9, 7, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 
3, 3, 3, 3, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 6, 
6, 6, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 8, 8, 7, 7, 7, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 
9, 9, 9, 9, 9, 9, 10, 11, 11, 11, 8, 11, 8, 10, 10, 11, 11, 11, 
11, 11, 9, 11, 11, 11, 10, 10, 10, 12, 12, 12, 12, 13, 13, 16, 
16, 16, 16, 17, 18, 19, 19, 19, 19, 19, 18, 18, 18, 20, 20, 20, 
20, 20, 20, 20, 18, 18, 18, 16, 16, 16, 13, 13, 13, 11, 10, 10, 
10, 10, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13)

这些向量始终以 0 开头,因为比赛以 0-0 平局开始。 leads_vector[100] 等于 14,这意味着主队在比赛中以 14 的优势获胜。请注意,向量中的数字会重复,因为在一场篮球比赛中,连续几场比赛的得分可能保持不变。

我想计算的 4 个指标是:

最大的领先优势很容易计算:

biggest_lead <- abs(max(leads_vector))

平局次数比较难计算:

times_tied <- sum(leads_vector[2:length(leads_vector)] == 0 & leads_vector[1:(length(leads_vector)-1)] != 0)

times_tied 检查向量中所有值为 0 的实例(得分并列),并且向量中的前一个值不为 0。这确保每个零序列都算作比分只有一次平手。

我不确定如何计算最长 运行。游戏中最长的运行是向量中最大的单调递增或递减序列。仅使用视力测试,我注意到在 leads_vector[38:65] 处有一个长 运行 8。

铅变化的数量也很难计算。它将等于此向量中导联从正变为负的次数。以下leads_vector:

c(3, -3, 2, 5, 4, 3, 0, 2, -3, -1, -4, -5, -2, 0, 1)

...将有 4 次领先变化(从 3 到 -3,从 -3 到 2,从 2 到 -3,以及从 -2 到 0 到 1)。

感谢任何帮助!

编辑 - 最长 运行 是此处难以计算的统计数据,但我正在努力。 EDIT2 - 如果我从 leads_vector 中删除重复值,我认为最长的 运行 会更容易计算。但我不能使用 duplicated() 函数,因为这会删除向量中不同位置的重复项。相反,我只想删除彼此相邻的重复值(获取 c(0, -2, 5, 3, 5, 8, 10, 11, 9, 11, 9, 11, ... ))

我发现了如何使用 sign() 和 diff() 函数计算铅变化。首先,我需要过滤掉 lead 等于 0 的值,因为这些不是我计算的 lead 变化,即使 R 的 sign() 函数具有不同的 (+)、(-) 和 0 值。我有这个:

lead_changes <- sum(diff(sign(leads_vector[leads_vector != 0]))) / 2

对于最长的 运行,我认为从这个开始,删除重复值,是一个好的开始:

lead_changes[c(TRUE, lead_changes[-1] != hL[-length(hLlead_changes])]
#biggest_lead
with(rle(leads_vector), max(abs(values)))

#number_ties
with(rle(leads_vector), sum(values == 0))

#longest_run

#lead_changes 
length(rle(leads_vector[leads_vector != 0] < 0)$values)

计算最长运行:

compute_longest_run <- function(x) {
  # Collapse repetitions
  x_unique <- rle(x)$values

  # Compute score change
  score_change <- diff(x_unique)

  # Need to compute sum of all subvectors with the same sign
  run_side <- sign(score_change)
  run_id <- c(1, cumsum(diff(run_side) != 0) + 1)
  run_value <- tapply(score_change, run_id, sum)

  max(abs(run_value))
}

compute_longest_run(leads_vector)
#> [1] 10