在大型数据库中的 R 中创建列

Create column in R in a large database

如果这个问题已经得到回答,我很抱歉,但我还没有找到。我会 post 我所有的想法来解决它。问题是数据库很大,我的 PC 无法执行此计算(核心 i7 和 8 GB RAM)。我正在使用 Microsoft R Open 3.3.2 和 RStudio 1.0.136。

我试图在 R 中的一个大型数据库上创建一个名为 tcm.RData (471 MB) 的新列。我需要一个列,它将 Shape_Area 除以 Shape_Area 的总和除以 COD(我称之为 ShapeSum)。我首先尝试用一个公式来做,但失败了,我分两步再次尝试:1) 用 COD 求和 Shape_Area,如果成功,用 ShapeSum 除 Shape_Area。

> str(tcm)
    Classes ‘data.table’ and 'data.frame':  26835293 obs. of  15 variables:
    $ OBJECTID  : int  1 2 3 4 5 6 7 8 9 10 ...
    $ LAT       : num  -15.7 -15.7 -15.7 -15.7 -15.7 ...
    $ LONG      : num  -58.1 -58.1 -58.1 -58.1 -58.1 ...
    $ UF        : chr  "MT" "MT" "MT" "MT" ...
    $ COD       : num  510562 510562 510562 510562 510562 ...
    $ AREA_97   : num  1130 1130 1130 1130 1130 ...
    $ Shape_Area: num  255266.7 14875 25182.2 5503.9 95.5 ...
    $ TYPE      : chr  "2" "2" "2" "2" ...
    $ Nomes     : chr  NA NA NA NA ...
    $ NEAR_DIST : num  376104 371332 371410 371592 371330 ...
    $ tc_2004   : chr  "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
    $ tc_2008   : chr  "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
    $ tc_2010   : chr  "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
    $ tc_2012   : chr  "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
    $ tc_2014   : chr  "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
    - attr(*, ".internal.selfref")=<externalptr> 

> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
     Error: cannot allocate vector of size 204.7 Mb
     Error during wrapup: cannot allocate vector of size 542.3 Mb

我也尝试了以下代码,但都失败了:

> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")

Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") : dim(X) must have a positive lenght

> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")

Error: cannot allocate vector of size 204.7 Mb Error during wrapup: cannot allocate vector of size 542.3 Mb

> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]

> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))

Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate

非常感谢您的关注和任何解决此问题的建议。

library(data.table)

tcm <- fread("yout_tcm_file.txt")

tcm[, newColumn:=oldColumnPlusOne+1]

更多: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

我们可以使用 data.table 方法来创建列,因为它通过就地发生的赋值 (:=) 更有效

library(data.table)
tcm[, ShapeSum := sum(Shape_Area), by = COD]

或者如@user20650 所建议的那样(根据 OP 的描述)

tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]