从线性回归模型中提取示例数据点
Pulling example data point from linear regression model
我最近在 R-Studio 中创建了一个线性回归模型,如下所示:
> model1 = lm(price~sqft_living,train)
> pred_train = predict(model1)
> rmse_train = sqrt(mean((pred_train - train$price)^2))
> rmse_train
[1] 261068.9
> pred_test = predict(model1,newdata=test)
> rmse_test = sqrt(mean((pred_test - test$price)^2))
> rmse_test
[1] 262334.4
> sse = sum((pred_train - train$price)^2)
> sst = sum((mean(train$price)-train$price)^2)
> r2 = 1 - sse/sst
> r2
[1] 0.4967993
> summary(model1)
Call:
lm(formula = price ~ sqft_living, data = train)
Residuals:
Min 1Q Median 3Q Max
-1491759 -146386 -24131 106578 4348558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47764.278 5250.938 -9.096 <0.0000000000000002 ***
sqft_living 282.092 2.305 122.381 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261100 on 15170 degrees of freedom
Multiple R-squared: 0.4968, Adjusted R-squared: 0.4968
F-statistic: 1.498e+04 on 1 and 15170 DF, p-value: < 0.00000000000000022
我的问题是我需要了解“根据 model1
,平均而言,1400 平方英尺的房子要花多少钱?”
虽然这听起来有点傻,但我不知道如何在我的模型中找到它,而且我也没有在网上搜索它。任何帮助将不胜感激。
下面是一些显示数据集的代码:
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
> glimpse(houses)
Rows: 21,613
Columns: 16
$ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 195440051…
$ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, …
$ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, …
$ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.5…
$ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890…
$ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, …
$ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.…
$ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, …
$ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, …
$ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7,…
$ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 1890…
$ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, 0…
$ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 200…
$ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ age <dbl> 59, 63, 82, 49, 28, 13, 19, 52, 55, 12, 50, 72, 87, 37, 1…
要预测给定回归量新值的响应值,只需创建一个新数据集并在 predict
中使用它,R 的建模函数输出的对象是 S3 class 个对象,因此很可能有一些方法,在这种情况下 predict
,是为他们编写的。
model <- lm(price ~ sqft_living, houses)
new <- data.frame(sqft_living = 1400)
predict(model, newdata = new)
# 1
#357469.5
关于题中的RMSE,下面比较简单
rmse <- function(object){
e <- resid(object)
sqrt(mean(e^2, na.rm = TRUE))
}
rmse(model)
#[1] 80374.95
至于评论中的follow-up问题,
Based on model1, if a homeowner were to put in a 200 square foot addition on the house, how much would the price be expected to go up by?
答案很简单,sqft_living
项的模型系数是 price
的预期变化,回归量增加 1 个单位平均会导致。
coef(model)
#(Intercept) sqft_living
# 50960.6653 218.9349
coef(model)[2] * 200
#sqft_living
# 43786.98
如果计算相隔200个单位的sqft_living
2个值的价格,也可以得到这个结果。
new2 <- data.frame(sqft_living = c(1400, 1400 + 200))
ypred <- predict(model, newdata = new2)
diff(ypred)
# 2
#43786.98
与上面相同的值。
我最近在 R-Studio 中创建了一个线性回归模型,如下所示:
> model1 = lm(price~sqft_living,train)
> pred_train = predict(model1)
> rmse_train = sqrt(mean((pred_train - train$price)^2))
> rmse_train
[1] 261068.9
> pred_test = predict(model1,newdata=test)
> rmse_test = sqrt(mean((pred_test - test$price)^2))
> rmse_test
[1] 262334.4
> sse = sum((pred_train - train$price)^2)
> sst = sum((mean(train$price)-train$price)^2)
> r2 = 1 - sse/sst
> r2
[1] 0.4967993
> summary(model1)
Call:
lm(formula = price ~ sqft_living, data = train)
Residuals:
Min 1Q Median 3Q Max
-1491759 -146386 -24131 106578 4348558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47764.278 5250.938 -9.096 <0.0000000000000002 ***
sqft_living 282.092 2.305 122.381 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261100 on 15170 degrees of freedom
Multiple R-squared: 0.4968, Adjusted R-squared: 0.4968
F-statistic: 1.498e+04 on 1 and 15170 DF, p-value: < 0.00000000000000022
我的问题是我需要了解“根据 model1
,平均而言,1400 平方英尺的房子要花多少钱?”
虽然这听起来有点傻,但我不知道如何在我的模型中找到它,而且我也没有在网上搜索它。任何帮助将不胜感激。
下面是一些显示数据集的代码:
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
> glimpse(houses)
Rows: 21,613
Columns: 16
$ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 195440051…
$ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, …
$ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, …
$ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.5…
$ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890…
$ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, …
$ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.…
$ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, …
$ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, …
$ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7,…
$ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 1890…
$ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, 0…
$ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 200…
$ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ age <dbl> 59, 63, 82, 49, 28, 13, 19, 52, 55, 12, 50, 72, 87, 37, 1…
要预测给定回归量新值的响应值,只需创建一个新数据集并在 predict
中使用它,R 的建模函数输出的对象是 S3 class 个对象,因此很可能有一些方法,在这种情况下 predict
,是为他们编写的。
model <- lm(price ~ sqft_living, houses)
new <- data.frame(sqft_living = 1400)
predict(model, newdata = new)
# 1
#357469.5
关于题中的RMSE,下面比较简单
rmse <- function(object){
e <- resid(object)
sqrt(mean(e^2, na.rm = TRUE))
}
rmse(model)
#[1] 80374.95
至于评论中的follow-up问题,
Based on model1, if a homeowner were to put in a 200 square foot addition on the house, how much would the price be expected to go up by?
答案很简单,sqft_living
项的模型系数是 price
的预期变化,回归量增加 1 个单位平均会导致。
coef(model)
#(Intercept) sqft_living
# 50960.6653 218.9349
coef(model)[2] * 200
#sqft_living
# 43786.98
如果计算相隔200个单位的sqft_living
2个值的价格,也可以得到这个结果。
new2 <- data.frame(sqft_living = c(1400, 1400 + 200))
ypred <- predict(model, newdata = new2)
diff(ypred)
# 2
#43786.98
与上面相同的值。