为什么 scipy 尽量减少 return 这种错误的估计？

Question

我正在尝试使用 scipy 最小化来估计 ODE 系统的参数，这非常简单，但是我使用的方法并没有 returning 值接近值应该是。我的参数 beta 的估计值应该在 0.42 左右。我确信这个方法是正确的，所以我不明白为什么估计这么差

import numpy as np
from scipy.integrate import odeint
from scipy.optimize import minimize
from scipy.optimize import minimize_scalar
import pandas as pd
from scipy.optimize import leastsq

t = np.linspace(0, 77, 77+1)
d = {'Week': [t[7],t[14],t[21],t[28],t[35],t[42],t[49],t[56],t[63],t[70],t[77]], 
     'incidence': [206.1705794,2813.420201,11827.9453,30497.58655,10757.66954,
                   7071.878779,3046.752723,1314.222882,765.9763902,201.3800578,109.8982006]}
df = pd.DataFrame(data=d)

def peak_infections(beta, df):
 
    # Weeks for which the ODE system will be solved
    #weeks = df.Week.to_numpy()

    # Total population, N.
    N = 100000
    # Initial number of infected and recovered individuals, I0 and R0.
    I0, R0 = 10, 0
    # Everyone else, S0, is susceptible to infection initially.
    S0 = N - I0 - R0
    J0 = I0
    # Contact rate, beta, and mean recovery rate, gamma, (in 1/days).
    #reproductive no. R zero is beta/gamma
    gamma = 1/6 #rate should be in weeks now
    # A grid of time points 
    times = np.arange(7,84,7)

    # The SIR model differential equations.
    def deriv(y, times, N, beta, gamma):
        S, I, R, J = y
        dS = ((-beta * S * I) / N)
        dI = ((beta * S * I) / N) - (gamma * I)
        dR = (gamma * I)
        dJ = ((beta * S * I) / N) #incidence
        return dS, dI, dR, dJ

    # Initial conditions are S0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    solve = odeint(deriv, (S0, I0, R0, J0), times, args=(N, beta, gamma))
    S, I, R, J = solve.T

    return I/N

def residual(x, df):

    # Total population,  N.
    N = 100000
    incidence = df.incidence.to_numpy()/N
    return np.sum((peak_infections(x,df) - incidence) ** 2)

x0 = 0.5
res = minimize(residual, x0, args=(df), method="Nelder-Mead", options={'fatol':1e-04}).x
print(res)

best = leastsq(residual, x0,args=(df))
print(best) #tried this using leastsq too

results = minimize_scalar(residual,(0.4, 0.5),args=(df))
print(results)
results['fun']

如您所见，我使用了最小化、minimize_scalar 甚至 leastsq。它们都是 return 的值，例如 0.723。我哪里出错了？我的 objective 函数 return np.sum((peak_infections(x,df) - incidence) ** 2) 至少是正确的吗？

编辑：我尝试在 peak_infections(beta, df) 函数中使用 np.max(I/N) 但这也不是 return 正确的估计

Edit2：使用 beta 估计测试真实数据的代码：

import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import pandas as pd

#three compartments, Susceptible S, infected I, recovered R
#dS/dt, dI/dt, dR/dt
#susceptible sees birth rate coming in, deaths leaving and force of infection leaving
#infected sees FOI coming in, deaths leaving and recovery rates
#recovered sees recovery rate coming in, deaths leaving
#beta is tranmission coefficient, FOI is beta * (I/N) where N is total pop
#initially consider a model not accounting for births and death

# Total population, N.
N = 100000
# Initial number of infected and recovered individuals, I0 and R0.
I0, R0 = 10, 0
# Everyone else, S0, is susceptible to infection initially.
S0 = N - I0 - R0
J0 = I0
# Contact rate, beta, and mean recovery rate, gamma, (in 1/days).
#reproductive no. R zero is beta/gamma
beta, gamma = 0.4205, 1/6
# A grid of time points (in days)
t = np.linspace(0, 77, 77+1)
t7 = np.arange(0, 84, 7)
t1 = [0,1,2,3,4,5,6,7,8,9,10,11,12]
t1 =  [element * 7 for element in t1]
t1 = np.array(t1)

# The SIR model differential equations.
def deriv(y, t7, N, beta, gamma):
    S, I, R, J = y
    dS = ((-beta * S * I) / N)
    dI = ((beta * S * I) / N) - (gamma * I)
    dR = (gamma * I)
    dJ = ((beta * S * I) / N)
    return dS, dI, dR, dJ

# Initial conditions are S0, I0, R0
# Integrate the SIR equations over the time grid, t.
solve = odeint(deriv, (S0, I0, R0, J0), t7, args=(N, beta, gamma))
S, I, R, J = solve.T


d = {'Week': [t[0], t[7],t[14],t[21],t[28],t[35],t[42],t[49],t[56],t[63],t[70],t[77]], 'incidence': [0,206.1705794,2813.420201,11827.9453,30497.58655,10757.66954,7071.878779,3046.752723,1314.222882,765.9763902,201.3800578,109.8982006]}
df = pd.DataFrame(data=d)
df.plot(x='Week', y='incidence')


J_diff = J[0:] - J[:1]
J_diff = np.diff(J)
fig = plt.figure(facecolor='w')
ax = fig.add_subplot(111, facecolor='#dddddd', axisbelow=True)
#ax.plot(t, J, 'red', alpha=1, lw=2, label='Cumulative incidence')
ax.plot(t7[1:], J_diff, 'blue', alpha=1, lw=2, label='Daily incidence')
ax.plot(t1[1:], df.incidence, 'r', alpha=1, lw=2, label='weekly data')
ax.set_xlabel('Time in days')
ax.set_ylabel('Number')
ax.grid(b=True, which='major', c='w', lw=2, ls='-')
legend = ax.legend()
legend.get_frame().set_alpha(0.5)

plt.show()

returns 这适合 beta = 0.4205

Answer 1

假设模型没问题，问题是你假设 beta 应该接近 0.42 而实际上不应该。绘制测量值和建模数据的简单视觉测试表明 0.72 产生的结果比 0.43 好得多。我添加了以下几行：

import matplotlib.pyplot as plt
plt.plot(d['Week'], df.incidence.to_numpy()/100000, label="Real data")
plt.plot(d['Week'], peak_infections(.72, df), label="Model with 0.72")
plt.plot(d['Week'], peak_infections(.42, df), label="Model with .42")
plt.legend()

得到下图：

很明显 0.72 是比 0.42 更好的 beta 估计，即使没有绘制残差。

作为旁注，请小心，因为您的代码很容易 break.You 在许多地方定义相同的值：N 在 peak_infections 和 residual 中定义.星期几定义为 d['Week'] 和 peak_infections 内的 times。很容易在一个地方更改其中一个变量的值而忘记在另一个地方更改它。此外，我在这里看不到 pandas 的用处，因为您不会在 numpy.

之上利用它的功能

为什么 scipy 尽量减少 return 这种错误的估计？

Why does scipy minimize return such bad estimates?

python

numpy

scipy