线性模型:长数据框还是宽数据框?

linear model: long or wide data frame?

我正在尝试理解 R 中的多元线性回归。

我有一个看起来像这样的数据框。您可以看到有一个包含不同渠道信息的 Source_Group 类别,还有一个 Spend 列显示花费的钱。

       Date Source_Group    Spend Total_Orders year month
1 2021-01-01          OTT 12359.16           28 2021     1
2 2021-01-01  Paid Search 17266.55          190 2021     1
3 2021-01-01  Paid Social  6799.28           40 2021     1
4 2021-01-01      YouTube     0.00            7 2021     1
5 2021-01-02          OTT  9104.31           42 2021     1

这里是 dput 重新创建第一个数据框的部分代码:

structure(list(Date = structure(c(18628, 18628, 18628, 18628, 
18629), class = "Date"), Source_Group = structure(c(11L, 12L, 
13L, 17L, 11L), .Label = c("Article Or Blog", "Audio", "Direct", 
"Email", "From A Friend", "From Contacts", "Influencer", "Organic Search", 
"Organic Social", "Other", "OTT", "Paid Search", "Paid Social", 
"Pepperjam", "Podcast", "Reddit", "YouTube", "Organic", "Peoplehype"
), class = "factor"), Spend = c(12359.16, 17266.55, 6799.28, 
0, 9104.31), Total_Orders = c(28, 190, 40, 7, 42), year = c(2021, 
2021, 2021, 2021, 2021), month = structure(c(1L, 1L, 1L, 1L, 
1L), .Label = c("1", "2", "3", "4"), class = "factor")), row.names = c(NA, 
-5L), groups = structure(list(Date = structure(c(18628, 18629
), class = "Date"), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

我想看看在不同营销渠道上花费的不同资金所产生的订单数量,并就如何分配资源做出一些决定。

使用该数据框,我是否创建了这样的线性模型:

linear_model_long_format <- lm(Total_Orders ~ Spend + Source_Group, df)

或者我应该使用以下代码将数据框重组为宽格式:

:

df_wide <- pivot_wider(df, names_from = Source_Group, values_from = Spend)

因此,我的数据框将如下所示:

这里是重新创建第二个数据框的一些输入代码:

structure(list(Date = structure(c(18628, 18628, 18628, 18628, 
18629), class = "Date"), Total_Orders = c(28, 190, 40, 7, 42), 
    year = c(2021, 2021, 2021, 2021, 2021), month = structure(c(1L, 
    1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"), 
    OTT = c(12359.16, 0, 0, 0, 9104.31), `Paid Search` = c(0, 
    17266.55, 0, 0, 0), `Paid Social` = c(0, 0, 6799.28, 0, 0
    ), YouTube = c(0, 0, 0, 0, 0)), row.names = c(NA, -5L), groups = structure(list(
    Date = structure(c(18628, 18629), class = "Date"), .rows = structure(list(
        1:4, 5L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))


df_wide $OTT[is.na(df_wide $OTT)] <- 0
df_wide $`Paid Search`[is.na(df_wide $`Paid Search`)] <- 0
df_wide $`Paid Social`[is.na(df_wide $`Paid Social`)] <- 0
df_wide $YouTube[is.na(df_wide $YouTube)] <- 0

我注意到我必须将 NA 值设为 0 才能避​​免出错。

我认为我的线性模型是这样的:

linear_model_wide_format <- lm(Total_Orders ~ OTT + `Paid Search` + `Paid Social` + YouTube, df_wide)

我看到的在线帖子似乎将这种更宽的格式用于线性模型,其中每一列都是一个变量,但同时我知道长格式在 R 中通常是首选,而且那些 0 让我真的怀疑宽格式是要走的路。我真的不确定。

长格式几乎肯定更好。如果您以长格式拟合模型,R 将使用 对比矩阵 将因子变量转换为一组二进制(虚拟)变量;这有点令人困惑,但可以让您在组之间进行各种比较。

使用equatiomatic::extract_eq(),我们得到

您可能还想尝试交互模型 Total_Orders ~ Spend*Source_Group,这将使您能够比较 个源组的支出对总订单的影响差异,即每单位支出增加的总订单预期变化(上面的 beta_1 参数)在来源组之间有何不同?

我将 extract_eq() 结果粘贴到 https://quicklatex.com/ 以获得 LaTeX 效果图