线性模型:长数据框还是宽数据框?
linear model: long or wide data frame?
我正在尝试理解 R 中的多元线性回归。
我有一个看起来像这样的数据框。您可以看到有一个包含不同渠道信息的 Source_Group
类别,还有一个 Spend
列显示花费的钱。
Date Source_Group Spend Total_Orders year month
1 2021-01-01 OTT 12359.16 28 2021 1
2 2021-01-01 Paid Search 17266.55 190 2021 1
3 2021-01-01 Paid Social 6799.28 40 2021 1
4 2021-01-01 YouTube 0.00 7 2021 1
5 2021-01-02 OTT 9104.31 42 2021 1
这里是 dput
重新创建第一个数据框的部分代码:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Source_Group = structure(c(11L, 12L,
13L, 17L, 11L), .Label = c("Article Or Blog", "Audio", "Direct",
"Email", "From A Friend", "From Contacts", "Influencer", "Organic Search",
"Organic Social", "Other", "OTT", "Paid Search", "Paid Social",
"Pepperjam", "Podcast", "Reddit", "YouTube", "Organic", "Peoplehype"
), class = "factor"), Spend = c(12359.16, 17266.55, 6799.28,
0, 9104.31), Total_Orders = c(28, 190, 40, 7, 42), year = c(2021,
2021, 2021, 2021, 2021), month = structure(c(1L, 1L, 1L, 1L,
1L), .Label = c("1", "2", "3", "4"), class = "factor")), row.names = c(NA,
-5L), groups = structure(list(Date = structure(c(18628, 18629
), class = "Date"), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
我想看看在不同营销渠道上花费的不同资金所产生的订单数量,并就如何分配资源做出一些决定。
使用该数据框,我是否创建了这样的线性模型:
linear_model_long_format <- lm(Total_Orders ~ Spend + Source_Group, df)
或者我应该使用以下代码将数据框重组为宽格式:
:
df_wide <- pivot_wider(df, names_from = Source_Group, values_from = Spend)
因此,我的数据框将如下所示:
这里是重新创建第二个数据框的一些输入代码:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Total_Orders = c(28, 190, 40, 7, 42),
year = c(2021, 2021, 2021, 2021, 2021), month = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
OTT = c(12359.16, 0, 0, 0, 9104.31), `Paid Search` = c(0,
17266.55, 0, 0, 0), `Paid Social` = c(0, 0, 6799.28, 0, 0
), YouTube = c(0, 0, 0, 0, 0)), row.names = c(NA, -5L), groups = structure(list(
Date = structure(c(18628, 18629), class = "Date"), .rows = structure(list(
1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df_wide $OTT[is.na(df_wide $OTT)] <- 0
df_wide $`Paid Search`[is.na(df_wide $`Paid Search`)] <- 0
df_wide $`Paid Social`[is.na(df_wide $`Paid Social`)] <- 0
df_wide $YouTube[is.na(df_wide $YouTube)] <- 0
我注意到我必须将 NA 值设为 0 才能避免出错。
我认为我的线性模型是这样的:
linear_model_wide_format <- lm(Total_Orders ~ OTT + `Paid Search` + `Paid Social` + YouTube, df_wide)
我看到的在线帖子似乎将这种更宽的格式用于线性模型,其中每一列都是一个变量,但同时我知道长格式在 R 中通常是首选,而且那些 0 让我真的怀疑宽格式是要走的路。我真的不确定。
长格式几乎肯定更好。如果您以长格式拟合模型,R 将使用 对比矩阵 将因子变量转换为一组二进制(虚拟)变量;这有点令人困惑,但可以让您在组之间进行各种比较。
使用equatiomatic::extract_eq()
,我们得到
您可能还想尝试交互模型 Total_Orders ~ Spend*Source_Group
,这将使您能够比较跨 个源组的支出对总订单的影响差异,即每单位支出增加的总订单预期变化(上面的 beta_1 参数)在来源组之间有何不同?
我将 extract_eq()
结果粘贴到 https://quicklatex.com/ 以获得 LaTeX 效果图
我正在尝试理解 R 中的多元线性回归。
我有一个看起来像这样的数据框。您可以看到有一个包含不同渠道信息的 Source_Group
类别,还有一个 Spend
列显示花费的钱。
Date Source_Group Spend Total_Orders year month
1 2021-01-01 OTT 12359.16 28 2021 1
2 2021-01-01 Paid Search 17266.55 190 2021 1
3 2021-01-01 Paid Social 6799.28 40 2021 1
4 2021-01-01 YouTube 0.00 7 2021 1
5 2021-01-02 OTT 9104.31 42 2021 1
这里是 dput
重新创建第一个数据框的部分代码:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Source_Group = structure(c(11L, 12L,
13L, 17L, 11L), .Label = c("Article Or Blog", "Audio", "Direct",
"Email", "From A Friend", "From Contacts", "Influencer", "Organic Search",
"Organic Social", "Other", "OTT", "Paid Search", "Paid Social",
"Pepperjam", "Podcast", "Reddit", "YouTube", "Organic", "Peoplehype"
), class = "factor"), Spend = c(12359.16, 17266.55, 6799.28,
0, 9104.31), Total_Orders = c(28, 190, 40, 7, 42), year = c(2021,
2021, 2021, 2021, 2021), month = structure(c(1L, 1L, 1L, 1L,
1L), .Label = c("1", "2", "3", "4"), class = "factor")), row.names = c(NA,
-5L), groups = structure(list(Date = structure(c(18628, 18629
), class = "Date"), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
我想看看在不同营销渠道上花费的不同资金所产生的订单数量,并就如何分配资源做出一些决定。
使用该数据框,我是否创建了这样的线性模型:
linear_model_long_format <- lm(Total_Orders ~ Spend + Source_Group, df)
或者我应该使用以下代码将数据框重组为宽格式:
df_wide <- pivot_wider(df, names_from = Source_Group, values_from = Spend)
因此,我的数据框将如下所示:
这里是重新创建第二个数据框的一些输入代码:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Total_Orders = c(28, 190, 40, 7, 42),
year = c(2021, 2021, 2021, 2021, 2021), month = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
OTT = c(12359.16, 0, 0, 0, 9104.31), `Paid Search` = c(0,
17266.55, 0, 0, 0), `Paid Social` = c(0, 0, 6799.28, 0, 0
), YouTube = c(0, 0, 0, 0, 0)), row.names = c(NA, -5L), groups = structure(list(
Date = structure(c(18628, 18629), class = "Date"), .rows = structure(list(
1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df_wide $OTT[is.na(df_wide $OTT)] <- 0
df_wide $`Paid Search`[is.na(df_wide $`Paid Search`)] <- 0
df_wide $`Paid Social`[is.na(df_wide $`Paid Social`)] <- 0
df_wide $YouTube[is.na(df_wide $YouTube)] <- 0
我注意到我必须将 NA 值设为 0 才能避免出错。
我认为我的线性模型是这样的:
linear_model_wide_format <- lm(Total_Orders ~ OTT + `Paid Search` + `Paid Social` + YouTube, df_wide)
我看到的在线帖子似乎将这种更宽的格式用于线性模型,其中每一列都是一个变量,但同时我知道长格式在 R 中通常是首选,而且那些 0 让我真的怀疑宽格式是要走的路。我真的不确定。
长格式几乎肯定更好。如果您以长格式拟合模型,R 将使用 对比矩阵 将因子变量转换为一组二进制(虚拟)变量;这有点令人困惑,但可以让您在组之间进行各种比较。
使用equatiomatic::extract_eq()
,我们得到
您可能还想尝试交互模型 Total_Orders ~ Spend*Source_Group
,这将使您能够比较跨 个源组的支出对总订单的影响差异,即每单位支出增加的总订单预期变化(上面的 beta_1 参数)在来源组之间有何不同?
我将 extract_eq()
结果粘贴到 https://quicklatex.com/ 以获得 LaTeX 效果图