在回归中，我的 DV 和 IV 中的百分比特征使用哪种算法？

Which algorithm to use for percentage features in my DV and IV, in regression?

我正在使用回归来分析服务器数据以找出特征重要性。

我的一些 IV（自变量）或 X 以百分比表示，例如时间百分比、内核百分比、已用资源百分比，而其他一些则以字节数等数字表示。

我用 (X-X_mean)/X_stddev 标准化了我所有的 X。（我这样做错了吗？）

如果我的 IV 是数字和 %s 的混合并且我在以下情况下预测 Y：Python 我应该使用哪种算法：

Case 1: Predict a continuous valued Y

a.Will using a Lasso regression suffice?

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

c. How do I interpret the X-coefficient if X is standardized and is a %?

Case 2: Predict a %-ed valued Y, like "% resource used".

a. Should I use Beta-Regression? If so which package in Python offers this?

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

c. How do I interpret the X-coefficient if X is standardized and is a %?

如果我在标准化已经是 % 的 X 时出错，是否可以将这些数字用作 0.30 表示 30%，以便它们落在 0-1 范围内？所以这意味着我不标准化它们，我仍然会标准化其他数字 IV。

Final Aim for both Cases 1 and 2:

To find the % of impact of IVs on Y. e.g.: When X1 increases by 1 unit, Y increases by 21%

我从其他帖子了解到，我们永远无法将所有系数加起来达到 100 来评估每个 IV 对 DV 的影响百分比。我希望我在这方面是正确的。

混合预测变量对任何形式的回归都没有影响，这只会改变您解释系数的方式。然而，重要的是 Y 变量的 type/distribution

Case 1: Predict a continuous valued Y a.Will using a Lasso regression suffice?

常规 OLS 回归对此很有效

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

系数的解释始终遵循这样的格式：“对于 X 的 1 个单位的变化，我们期望 Y 的变化量 x-coefficient，同时保持其他预测变量不变”

因为你对X进行了标准化，所以你的单位是标准差。因此解释将是“对于 X 的 1 个标准差变化，我们期望 Y 的变化量 X-coefficient...”

c. How do I interpret the X-coefficient if X is standardized and is a %?

同上。你的单位仍然是标准偏差，尽管它最初来自百分比

Case 2: Predict a %-ed valued Y, like % resource used.

a. Should I use Beta-Regression? If so which package in Python offers this?

这很棘手。当 Y 结果是百分比时，典型的建议是使用二项式逻辑回归之类的方法。

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

c. How do I interpret the X-coefficient if X is standardized and is a %?

同上解说。但是如果你使用逻辑回归，它们是以对数几率为单位的。我建议阅读逻辑回归以更深入地了解其工作原理

If I am wrong in standardizing the Xs which are a % already , is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? So that means I do not standardize them, I will still standardize the other numeric IVs.

标准化对于回归中的变量非常好，但就像我说的，它改变了你的解释，因为你的单位现在是标准差

Final Aim for both cases 1 & 2:

To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%

如果你的 Y 是一个百分比并且你使用类似 OLS 回归的东西，那么这正是你解释系数的方式（对于 X1 中 1 个单位的变化，Y 变化一定百分比）

你的问题混淆了一些概念，混淆了很多术语。本质上，您是在询问 a)（线性）回归的特征预处理，b）线性回归系数的可解释性，以及 c）敏感性分析（特征 X_i 对 Y 的影响）。但要小心，因为你做了一个巨大的假设，即 Y 线性依赖于每个 X_i，见下文。

标准化不是"algorithm"，只是一种预处理数据的技术。
回归需要标准化，但 tree-based 算法不需要标准化（RF/XGB/GBT） - 有了这些，您可以输入原始数字直接特征（百分比、总数等）。
(X-X_mean)/X_stddev 不是标准化，是规范化。
- (另一种方法是 (true) standardization 即：(X-X_min)/(X_max-X_min)，它将每个变量转换到 [0,1] 范围内；或者您可以转换为 [0,1].
最后你问回归中的敏感性分析：我们可以直接将X_i的回归系数解释为Y对X_i的敏感性吗？
- 停下来想想你在 "Final Aim for both cases 1 & 2: To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%".
- 您假设因变量与每个自变量都具有线性关系。但往往不是这样，它可能是非线性的。例如，如果您查看年龄对薪水的影响，您通常会看到它增加到 40 秒/50 秒，然后逐渐下降，当您达到退休年龄（比如 65 岁）时，急剧下降。
- 因此，您可以将年龄对薪水的影响建模为二次或 higher-order 多项式，方法是输入 Age^2 和 Age^3 项（否则有时您可能会看到 sqrt(X) , log(X), log1p(X), exp(X) 等术语。任何最能捕捉非线性关系的术语。您可能还会看到 variable-variable 交互项，尽管回归严格假设变量不相关彼此。）
- 显然，年龄对薪水有很大的影响，但我们不会通过组合年龄、年龄^2、年龄^3的（绝对值）系数来衡量薪水对年龄的敏感性。
- 如果我们只有年龄的线性项，年龄的单一系数将大大低估年龄对薪水的影响，它将 "average out" 年龄<40 与年龄>50
所以the general answer to "Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?" is "Only if the relationship between Y and that X_i is linear, otherwise no".
一般来说，更好更简单的灵敏度分析方法（不假设线性响应，或需要对 % 特征进行标准化）是 tree-based 算法 (RF/XGB/GBT) 生成 特征重要性 。
- 顺便说一句，我知道你的练习告诉你使用回归，但通常你会更快地从 tree-based (RF/XGB) 获得 feature-importance 信息，特别是对于浅层树（max_depth 的小值，nodesize 的大值，例如 >training-set 大小的 0.1%）。这就是人们使用它的原因，即使他们的最终目标是回归。

（你的问题是在 CrossValidated 上会得到更好的答案，但是离开这里也很好，这里有交叉）。

在回归中，我的 DV 和 IV 中的百分比特征使用哪种算法？

Which algorithm to use for percentage features in my DV and IV, in regression?

python

statistics

regression

feature-extraction

percentage