在大型数据库中的 R 中创建列
Create column in R in a large database
如果这个问题已经得到回答,我很抱歉,但我还没有找到。我会 post 我所有的想法来解决它。问题是数据库很大,我的 PC 无法执行此计算(核心 i7 和 8 GB RAM)。我正在使用 Microsoft R Open 3.3.2 和 RStudio 1.0.136。
我试图在 R 中的一个大型数据库上创建一个名为 tcm.RData (471 MB) 的新列。我需要一个列,它将 Shape_Area 除以 Shape_Area 的总和除以 COD(我称之为 ShapeSum)。我首先尝试用一个公式来做,但失败了,我分两步再次尝试:1) 用 COD 求和 Shape_Area,如果成功,用 ShapeSum 除 Shape_Area。
> str(tcm)
Classes ‘data.table’ and 'data.frame': 26835293 obs. of 15 variables:
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LAT : num -15.7 -15.7 -15.7 -15.7 -15.7 ...
$ LONG : num -58.1 -58.1 -58.1 -58.1 -58.1 ...
$ UF : chr "MT" "MT" "MT" "MT" ...
$ COD : num 510562 510562 510562 510562 510562 ...
$ AREA_97 : num 1130 1130 1130 1130 1130 ...
$ Shape_Area: num 255266.7 14875 25182.2 5503.9 95.5 ...
$ TYPE : chr "2" "2" "2" "2" ...
$ Nomes : chr NA NA NA NA ...
$ NEAR_DIST : num 376104 371332 371410 371592 371330 ...
$ tc_2004 : chr "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
$ tc_2008 : chr "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
$ tc_2010 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
$ tc_2012 : chr "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
$ tc_2014 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
- attr(*, ".internal.selfref")=<externalptr>
> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb
我也尝试了以下代码,但都失败了:
> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")
Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") :
dim(X) must have a positive lenght
> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb
> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]
> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))
Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate
非常感谢您的关注和任何解决此问题的建议。
library(data.table)
tcm <- fread("yout_tcm_file.txt")
tcm[, newColumn:=oldColumnPlusOne+1]
更多:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
我们可以使用 data.table
方法来创建列,因为它通过就地发生的赋值 (:=
) 更有效
library(data.table)
tcm[, ShapeSum := sum(Shape_Area), by = COD]
或者如@user20650 所建议的那样(根据 OP 的描述)
tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]
如果这个问题已经得到回答,我很抱歉,但我还没有找到。我会 post 我所有的想法来解决它。问题是数据库很大,我的 PC 无法执行此计算(核心 i7 和 8 GB RAM)。我正在使用 Microsoft R Open 3.3.2 和 RStudio 1.0.136。
我试图在 R 中的一个大型数据库上创建一个名为 tcm.RData (471 MB) 的新列。我需要一个列,它将 Shape_Area 除以 Shape_Area 的总和除以 COD(我称之为 ShapeSum)。我首先尝试用一个公式来做,但失败了,我分两步再次尝试:1) 用 COD 求和 Shape_Area,如果成功,用 ShapeSum 除 Shape_Area。
> str(tcm)
Classes ‘data.table’ and 'data.frame': 26835293 obs. of 15 variables:
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LAT : num -15.7 -15.7 -15.7 -15.7 -15.7 ...
$ LONG : num -58.1 -58.1 -58.1 -58.1 -58.1 ...
$ UF : chr "MT" "MT" "MT" "MT" ...
$ COD : num 510562 510562 510562 510562 510562 ...
$ AREA_97 : num 1130 1130 1130 1130 1130 ...
$ Shape_Area: num 255266.7 14875 25182.2 5503.9 95.5 ...
$ TYPE : chr "2" "2" "2" "2" ...
$ Nomes : chr NA NA NA NA ...
$ NEAR_DIST : num 376104 371332 371410 371592 371330 ...
$ tc_2004 : chr "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
$ tc_2008 : chr "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
$ tc_2010 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
$ tc_2012 : chr "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
$ tc_2014 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
- attr(*, ".internal.selfref")=<externalptr>
> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb
我也尝试了以下代码,但都失败了:
> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")
Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") : dim(X) must have a positive lenght
> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")
Error: cannot allocate vector of size 204.7 Mb Error during wrapup: cannot allocate vector of size 542.3 Mb
> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]
> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))
Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate
非常感谢您的关注和任何解决此问题的建议。
library(data.table)
tcm <- fread("yout_tcm_file.txt")
tcm[, newColumn:=oldColumnPlusOne+1]
更多: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
我们可以使用 data.table
方法来创建列,因为它通过就地发生的赋值 (:=
) 更有效
library(data.table)
tcm[, ShapeSum := sum(Shape_Area), by = COD]
或者如@user20650 所建议的那样(根据 OP 的描述)
tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]