优化变异和总结的调用?

Optimize calls to mutate and summarise?

我有这个 R 脚本:

rm(list = ls())

library(tidyr)
suppressWarnings(library(dplyr))
outFile = "zFinal.lua"

cat("4\n")

cat(file = outFile, sep = "")

filea <- read.csv("csva.csv", strip.white = TRUE)
fileb <- read.csv("csvb.csv", strip.white = TRUE, sep = ";", header=FALSE)

df <-    
    merge(filea, fileb, by.x = c(3), by.y = c(1)) %>%
    subset(select = c(1, 3, 6, 2)) %>%
    arrange(ColA, ColB, V2) %>%
    group_by(ColA) %>%
    mutate(V2 = paste0('"', V2, "#", ColB, '"')) %>%
    summarise(ID = paste(V2, collapse = ", ", sep=";")) %>%
    mutate(ID = paste0('["', ColA, '"] = {', ID, '},')) %>%
    mutate(ID = paste0('\t\t', ID))

df <- df[c("ID")]

cat("\n\tmyTable = {\n", file = outFile, append = TRUE, sep = "\n")
write.table(df, append = TRUE, file = outFile, sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("\n\t}", file = outFile, append = TRUE, sep = "\n")

# Done
cat("\nDONE.", sep = "\n")

如您所见,此脚本打开 csva.csv 和 csvb.csv。

这是csva.csv:

ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1

这是csvb.csv:

100;file1
200;file2
300;file3
400;file4

这是我的脚本和 csv 文件生成的输出文件:

myTable = {

    ["3"] = {"file1#100", "file2#200"},
    ["7"] = {"file2#200", "file3#300"},
    ["30"] = {"file4#400"},

}

这个输出文件正是我想要的。太完美了。

这就是脚本的作用。我不确定我是否可以很好地解释这一点,所以如果我在这方面做得不好,请跳过此部分。

对于 csva.csv 中的每一行,如果 ColC (csva) 包含列 1 (csvb) 中包含的数字,则输出文件应包含如下一行:

["3"] = {"file1#100", "file2#200"},

因此,在上面的示例中,ColA (csva) 中的第一行包含数字 3,该行的 colB 是 100。在 csvb 中,第 1 列包含 100,第 2 列包含 file1#100。

因为csva在ColA(最后一行)中包含了另外一个数字3,所以这也被处理并输出到同一行。

好的,所以我的脚本 运行 确实非常好并且可以产生完美的输出。问题是 运行 花费的时间太长了。我这里的问题中的 csva 和 csvb 只有几行,所以输出是即时的。

但是,我必须在现实世界中使用的数据 - csva 超过 300,000 行,csvb 超过 900,000 行。所以脚本需要很长很长的时间才能运行(太长以至于不可行)。它确实工作得很好,但 运行.

花费的时间太长了

从逐渐注释掉行,似乎变慢与变异和总结有关。如果没有这些行,脚本 运行 会在大约 30 秒内完成。但是使用 mutate 和 summarise 需要几个小时。

我对 R 不太了解,所以如何通过改进我的语法或提供更快的变异和总结替代方案来使我的脚本 运行 更快?

您可以尝试将 table 加载为 data.table。通常 data.table 的操作速度比 data.frames

library(data.table)
filea <- fread("csva.csv")

在使用 mutate 函数之前,请检查它是否仍然是 data.table(只需打印它,您将看到与 data.frame 的明显区别)。

这是一个与您的方法非常相似的 dplyr 方法。真正的区别是行和列会尽快从对象中删除,因此移动的包袱更少。

我正在猜测什么对大型数据集有实际帮助。请报告之前和之后的持续时间。我喜欢你所说的哪些电话花费的时间最长;报告新瓶子也会有所帮助。

如果这还不够快,下一个最简单的步骤可能是移动到 sqldf (which uses SQLite under the cover) or data.table。两者都需要学习不同的语法(除非您已经知道 sql),但值得您花时间学习 运行.

# Pretend this info is being read from a file
str_a <-
"ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1"

str_b <-
"100;file1
200;file2
300;file3
400;file4"


# Declare the desired columns and their data types.
#   Include only the columns needed.  Use the smaller 'integer' data type where possible.
col_types_a <- readr::cols_only(
  `ID`      = readr::col_integer(),
  `ColA`    = readr::col_integer(),
  `ColB`    = readr::col_integer(),
  `ColC`    = readr::col_integer()
  # `ColD`    = readr::col_integer() # Exclude columns never used
)
col_types_b <- readr::cols_only(
  `ColB`      = readr::col_integer(),
  `file_name` = readr::col_character()
)

# Read the file into a tibble
ds_a <- readr::read_csv(str_a, col_types = col_types_a)
ds_b <- readr::read_delim(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)

ds_a %>% 
  dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
    ID,
    ColB,
    ColA
  ) %>% 
  dplyr::left_join(ds_b, by = "ColB") %>% # Join the two datasets
  tidyr::drop_na(file_name) %>%           # Dump the records you'll never use
  dplyr::mutate(                          # Create the hybrid column
    entry = paste0('"', file_name, "#", ColB, '"')
  ) %>%
  dplyr::select(                          # Dump the unneeded columns
    -ID,
    -file_name
  ) %>% 
  dplyr::group_by(ColA) %>%               # Create a bunch of subdatasets
  dplyr::arrange(ColB, entry) %>%         # Sorting inside the group usually is faster?
  dplyr::summarise(
    entry = paste(entry, collapse = ", ", sep = ";")
  ) %>%
  dplyr::ungroup() %>%                    # Stack all the subsets on top of each other
  dplyr::mutate(                          # Mush the two columns
    entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
  ) %>% 
  dplyr::pull(entry) %>%                  # Isolate the desired vector
  paste(collapse = "\n") %>%              # Combine all the elements into one.
  cat()

结果:

        ["3"] = {"file1#100", "file2#200"},
        ["7"] = {"file2#200", "file3#300"},
        ["30"] = {"file4#400"},

这是你的代码的更紧凑的版本,在 base R 中应该会提供一些性能提升。

(已编辑以匹配 wibeasley 提供的数据。)

ds_a$file_name <- ds_b$file_name[match(ds_a$ColB, ds_b$ColB)]
ds_a <- ds_a[!is.na(ds_a$file_name), -4]
ds_a <- ds_a[order(ds_a$ColB),]
ds_a$file_name <- paste0('"', ds_a$file_name, "#", ds_a$ColB, '"')
res <- tapply(ds_a$file_name, ds_a$ColA, FUN = paste,  collapse = ", ", sep=";")
res <- paste0("\t\t[\"", names(res), "\"] = {", res, "},", collapse = "\n")
cat("\n\tmyTable = {", res, "\t}", sep = "\n\n")

输出:

myTable = {

    ["3"] =  {"file1#100", "file2#200"},
    ["7"] =  {"file2#200", "file3#300"},
    ["30"] =  {"file4#400"},

}

这是另一个利用 data.table 性能的解决方案,同时仍保持在您的 dplyr 知识范围内。我不确定在短短 10 秒内是否有很大的改进空间,但理论上这可以帮助更大的数据集,其中创建索引的成本在更长的执行期间内摊销。

dtplyr 包正在将 dplyr 动词(您熟悉的)翻译为 data.table 语法。那就是利用键,这应该会提高性能,尤其是在连接和分组方面。

dtplyr::lazy_dt 功能可能有助于优化 dplyr 到 data.table 的转换。

最后,vroom 替换了 readr,主要是出于好奇。但它独立于其他更改,听起来这从来都不是瓶颈

col_types_a <- vroom::cols_only(
  `ID`      = vroom::col_integer(),
  `ColA`    = vroom::col_integer(),
  `ColB`    = vroom::col_integer(),
  `ColC`    = vroom::col_integer()
  # `ColD`    = vroom::col_integer() # Leave out this column b/c it's never used
)
col_types_b <- vroom::cols_only(
  `ColB`      = vroom::col_integer(),
  `file_name` = vroom::col_character()
)
ds_a <- vroom::vroom(str_a, col_types = col_types_a)
ds_b <- vroom::vroom(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)

# ds_a <- data.table::setDT(ds_a, key = c("ColB", "ColA"))
# ds_b <- data.table::setDT(ds_b, key = "ColB")

ds_a <- dtplyr::lazy_dt(ds_a, key_by = c("ColB", "ColA"))    # New line 1
ds_b <- dtplyr::lazy_dt(ds_b, key_by = "ColB")               # New line 2

ds_a %>% 
  dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
    ID,
    ColB,
    ColA
  ) %>%
  dplyr::inner_join(ds_b, by = "ColB") %>%                   # New line 3 (replaces left join)
  # tidyr::drop_na(file_name) %>%                            # Remove this line
  # dplyr::filter(!is.na(file_name)) %>%                     # Alternative w/ left join
  dplyr::mutate(
    entry = paste0('"', file_name, "#", ColB, '"')
  ) %>%
  dplyr::select( # Dump the uneeded columns
    -ID,
    -file_name
  ) %>% 
  dplyr::group_by(ColA) %>%
  dplyr::arrange(ColB, entry) %>%  # Sort inside the group usually helps
  dplyr::summarise(
    entry = paste(entry, collapse = ", ", sep=";")
  ) %>%
  dplyr::ungroup() %>% 
  dplyr::mutate(
    entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
  ) %>% 
  dplyr::pull(entry) %>% # Isolate the desired vector
  paste(collapse = "\n") %>% 
  cat()