需要根据列值拆分为长（超过 1,000,000 行）CSV 文件，并使用其他列的值重命名

Question

我有以下格式的文件夹 CSV 文件：

file-2017-08-14.csv

Ticker  Price   Date
AAPL    1   2017-08-14
AAPL    2   2017-08-14
AAPL    3   2017-08-14
AAPL    4   2017-08-14
MSFT    5   2017-08-14
MSFT    6   2017-08-14
MSFT    7   2017-08-14
GOOG    8   2017-08-14
GOOG    9   2017-08-14
...

file-2017-08-13.csv

Ticker  Price   Date
AAPL    1   2017-08-13
AAPL    2   2017-08-13
AAPL    3   2017-08-13
AAPL    4   2017-08-13
MSFT    5   2017-08-13
MSFT    6   2017-08-13
MSFT    7   2017-08-13
GOOG    8   2017-08-13
GOOG    9   2017-08-13
...

等等。我需要将其拆分为 2X3= 6 个子文件，并相应地命名：

/out/AAPL-2017-08-14.csv

Ticker  Price   Date
AAPL    1   2017-08-14
AAPL    2   2017-08-14
AAPL    3   2017-08-14
AAPL    4   2017-08-14

/out/MSFT-2017-08-14.csv

Ticker  Price   Date
MSFT    5   2017-08-14
MSFT    6   2017-08-14
MSFT    7   2017-08-14

/out/GOOG-2017-08-14.csv

Ticker  Price   Date
GOOG    8   2017-08-14
GOOG    9   2017-08-14

/out/AAPL-2017-08-13.csv

Ticker  Price   Date
AAPL    1   2017-08-13
AAPL    2   2017-08-13
AAPL    3   2017-08-13
AAPL    4   2017-08-13

/out/MSFT-2017-08-13.csv

Ticker  Price   Date
MSFT    5   2017-08-13
MSFT    6   2017-08-13
MSFT    7   2017-08-13

/out/GOOG-2017-08-13.csv

Ticker  Price   Date
GOOG    8   2017-08-13
GOOG    9   2017-08-13

我写了一个脚本，可以按代码分组并拆分为一个文件，但我不知道如何进行正确的重命名，也不知道如何遍历文件中的所有文件输入文件夹。

Import-Csv file-2017-08-14.csv | Group-Object -Property "Ticker" | Foreach-Object {
    $path = $_.Name + ".csv";
    $_.Group | Export-Csv -Path $path -NoTypeInformation
}

有什么想法吗？

Answer 1

方法一

Get-ChildItem -Filter '*.csv' -File -Force `
    | Select-Object -ExpandProperty 'FullName' `
    | Import-Csv -Delimiter "`t" `
    | ForEach-Object -Process {
        $outputFilePath = "out\{0}-{1}.csv" -f $_.Ticker, $_.Date;

        $_ | Export-Csv -Path $outputFilePath -Append -NoTypeInformation;
    };

以上几行执行以下操作：

Get-ChildItem 从当前目录（不包括 child 目录）检索 .csv 个文件
Get-ChildItem的结果将是FileInfo个实例，但我们想将代表文件路径的string个实例传递给Import-Csv，所以我们使用Select-Object 仅将 FullName 属性传递到管道
Import-Csv 读取管道中指定的 CSV 文件并将每条记录传递到管道
在 ForEach-Object 内部，$_ 变量保存每个 CSV 记录。我们使用其 Ticker 和 Date 属性（后者是 string 而不是 DateTime, so no formatting 必需的）构建适合该记录的输出路径。然后我们将记录传递给 Export-Csv，让它将新行附加到文件 $outputPath.

虽然这段代码简短而简单，但打开并附加到每个输入记录一次的输出文件非常慢，尤其是对于一百万行，尽管内存使用量很小，因为在任何给定时间内存中只有一条记录.

方法二

我们可以通过仅在每 1,000 条记录（或您喜欢的任何值）而不是每条记录后附加到每个输出文件来改进代码。 A HashTable 存储每个输出文件的挂起记录，当给定的输出文件超过挂起记录限制或没有更多记录可读（输入文件末尾）时，挂起记录被刷新：

$pendingRecordsByFilePath = @{};
$maxPendingRecordsPerFilePath = 1000;

Get-ChildItem -Filter '*.csv' -File -Force `
    | Select-Object -ExpandProperty 'FullName' `
    | Import-Csv -Delimiter "`t" `
    | ForEach-Object -Process {
        $outputFilePath = "out\{0}-{1}.csv" -f $_.Ticker, $_.Date;
        $pendingRecords = $pendingRecordsByFilePath[$outputFilePath];

        if ($pendingRecords -eq $null)
        {
            # This is the first time we're encountering this output file; create a new array
            $pendingRecords = @();
        }
        elseif ($pendingRecords.Length -ge $maxPendingRecordsPerFilePath)
        {
            # Flush all pending records for this output file
            $pendingRecords `
                | Export-Csv -Path $outputFilePath -Append -NoTypeInformation;
            $pendingRecords = @();
        }

        $pendingRecords += $_;
        $pendingRecordsByFilePath[$outputFilePath] = $pendingRecords;
    };

# No more input records to be read; flush all pending records for each output file
foreach ($outputFilePath in $pendingRecordsByFilePath.Keys)
{
    $pendingRecordsByFilePath[$outputFilePath] `
        | Export-Csv -Path $outputFilePath -Append -NoTypeInformation;
}

方法三

我们可以通过使用 List<object> 而不是数组来存储要写入的待处理记录来进一步改进这一点。通过在创建时将列表的容量设置为 $maxPendingRecordsPerFileName，这将消除每次添加另一条记录时扩展这些数组的开销。

$pendingRecordsByFilePath = @{};
$maxPendingRecordsPerFilePath = 1000;

Get-ChildItem -Filter '*.csv' -File -Force `
    | Select-Object -ExpandProperty 'FullName' `
    | Import-Csv -Delimiter "`t" `
    | ForEach-Object -Process {
        $outputFilePath = "out\{0}-{1}.csv" -f $_.Ticker, $_.Date;
        $pendingRecords = $pendingRecordsByFilePath[$outputFilePath];

        if ($pendingRecords -eq $null)
        {
            # This is the first time we're encountering this output file; create a new list
            $pendingRecords = New-Object `
                -TypeName 'System.Collections.Generic.List[Object]' `
                -ArgumentList (,$maxPendingRecordsPerFilePath);
            $pendingRecordsByFilePath[$outputFilePath] = $pendingRecords;
        }
        elseif ($pendingRecords.Count -ge $maxPendingRecordsPerFilePath)
        {
            # Flush all pending records for this output file
            $pendingRecords `
                | Export-Csv -Path $outputFilePath -Append -NoTypeInformation;
            $pendingRecords.Clear();
        }
        $pendingRecords.Add($_);
    };

# No more input records to be read; flush all pending records for each output file
foreach ($outputFilePath in $pendingRecordsByFilePath.Keys)
{
    $pendingRecordsByFilePath[$outputFilePath] `
        | Export-Csv -Path $outputFilePath -Append -NoTypeInformation;
}

方法 4a

如果我们使用 StreamWriter class，我们可以消除对输出缓冲 records/lines 的需要，并不断 opening/appending 输出文件。我们将为每个输出文件创建一个 StreamWriter 并在我们完成之前让它们保持打开状态。 try/finally 块是确保它们正确关闭所必需的。我使用 ConvertTo-Csv 生成输出，无论我们是否需要它总是包含一个 header 行，因此有逻辑确保我们只在文件首次打开时写入 header .

$truncateExistingOutputFiles = $true;
$outputFileWritersByPath = @{};

try
{
    Get-ChildItem -Filter '*.csv' -File -Force `
        | Select-Object -ExpandProperty 'FullName' `
        | Import-Csv -Delimiter "`t" `
        | ForEach-Object -Process {
            $outputFilePath = Join-Path -Path (Get-Location) -ChildPath ('out\{0}-{1}.csv' -f $_.Ticker, $_.Date);
            $outputFileWriter = $outputFileWritersByPath[$outputFilePath];
            $outputLines = $_ | ConvertTo-Csv -NoTypeInformation;

            if ($outputFileWriter -eq $null)
            {
                # This is the first time we're encountering this output file; create a new StreamWriter
                $outputFileWriter = New-Object `
                    -TypeName 'System.IO.StreamWriter' `
                    -ArgumentList ($outputFilePath, -not $truncateExistingOutputFiles, [System.Text.Encoding]::ASCII);

                $outputFileWritersByPath[$outputFilePath] = $outputFileWriter;

                # Write the header line
                $outputFileWriter.WriteLine($outputLines[0]);
            }

            # Write the data line
            $outputFileWriter.WriteLine($outputLines[1]);
        };
}
finally
{
    foreach ($writer in $outputFileWritersByPath.Values)
    {
        $writer.Close();
    }
}

令人惊讶的是，这导致了 175% 的性能变化...更慢。我会在进一步修改此代码时说明原因。

方法 4b

我解决性能下降的第一个想法是重新引入输出缓冲；基本上，结合方法 3 和 4a。同样令人惊讶的是，这只会进一步损害性能。我唯一的猜测是，因为 StreamWriter 有自己的字符缓冲，所以我们自己的缓冲就没有必要了。事实上，我测试了从 10 到 100,000 的 maxPendingRecordsPerFilePath 的 10 次方的值，这两个极端的整体性能差异仅为 5 秒。因此，我们自己的缓冲并没有真正帮助任何事情，管理 List 的微小开销在一百万次迭代中加起来会额外增加 30 秒运行时间。所以，让我们取消缓冲。

方法 4c

不是使用 ConvertTo-Csv 来输出 string 的 2 元素数组（一条 header 行和一条数据行），让我们使用 [= 自己构建这两行19=] 格式化。

方法 4d

在 ForEach-Object 的每次迭代中，我们需要构建输出文件路径，因为它基于输入 objects Ticker 和 Date 属性。我们在构造 StreamWriter 时传递绝对路径，因为 PowerShell 与典型的 .NET 应用程序相比，对 "current directory"（相对路径将基于此）有不同的概念。我们一直在调用 Get-Location 以在每次迭代时构建此绝对路径，这不是必需的，因为该路径不会更改。因此，让我们将调用移至 ForEach-Object 之外的 Get-Location。

方法 4e

我们不使用 Join-Path 构建输出文件路径，而是尝试使用 .NET 的 Path.Combine method.

方法 4f

我们不使用 Join-Path 来构建我们的输出文件路径，而是尝试使用 less platform-agnostic string 插值 ($outputFilePath = "$outputDirectoryPath$outputFileName";).

结合方法 4a、4c、4d 和 [=121] 的更改=]4e 我们得到这个最终代码：

$truncateExistingOutputFiles = $true;
$outputDirectoryPath = Join-Path -Path (Get-Location) -ChildPath 'out';
$outputFileWritersByPath = @{};

try
{
    Get-ChildItem -Filter '*.csv' -File -Force `
        | Select-Object -ExpandProperty 'FullName' `
        | Import-Csv -Delimiter "`t" `
        | ForEach-Object -Process {
            $outputFileName = '{0}-{1}.csv' -f $_.Ticker, $_.Date;
            $outputFilePath = [System.IO.Path]::Combine($outputDirectoryPath, $outputFileName);
            $outputFileWriter = $outputFileWritersByPath[$outputFilePath];

            if ($outputFileWriter -eq $null)
            {
                # This is the first time we're encountering this output file; create a new StreamWriter
                $outputFileWriter = New-Object `
                        -TypeName 'System.IO.StreamWriter' `
                        -ArgumentList ($outputFilePath, -not $truncateExistingOutputFiles, [System.Text.Encoding]::ASCII);

                $outputFileWritersByPath[$outputFilePath] = $outputFileWriter;

                # Write the header line
                $outputFileWriter.WriteLine('"Ticker","Price","Date"');
            }

            # Write the data line
            $outputFileWriter.WriteLine("""$($_.Ticker)"",""$($_.Price)"",""$($_.Date)""");
        };
}
finally
{
    foreach ($writer in $outputFileWritersByPath.Values)
    {
        $writer.Close();
    }
}

这里是我对每种方法的基准测试，平均超过三个运行秒，每个方法针对一百万行 CSV。这是在禁用 TurboBoost 的 Core i7 860 @ 2.8 GHz 上执行的运行在 Windows 10 Pro v1703 上使用 64 位 PowerShell v5.1:

+--------+----------------------+----------------------+--------------+---------------------+-----------------+
| Method |     Path handling    |     Line building    | File writing |   Output buffering  |  Execution time |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|    1   |       Relative       |      Export-Csv      |  Export-Csv  |          No         | 2,178.5 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|    2   |       Relative       |      Export-Csv      |  Export-Csv  | 1,000-element array |   222.9 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|    3   |       Relative       |      Export-Csv      |  Export-Csv  |  1,000-element List |   154.2 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4a   |       Join-Path      |     ConvertTo-Csv    | StreamWriter |          No         |   425.0 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4b   |       Join-Path      |     ConvertTo-Csv    | StreamWriter |  1,000-element List |   456.1 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4c   |       Join-Path      | String interpolation | StreamWriter |          No         |   302.5 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4d   |       Join-Path      | String interpolation | StreamWriter |          No         |   225.1 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4e   | [IO.Path]::Combine() | String interpolation | StreamWriter |          No         |    78.0 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+
|   4f   | String interpolation | String interpolation | StreamWriter |          No         |    77.7 seconds |
+--------+----------------------+----------------------+--------------+---------------------+-----------------+

要点：

与 Export-Csv 一起使用时，输出缓冲（1 → 2 和 1 → 3）可显着提高性能。
当与 StreamWriters 一起使用时，输出缓冲 (4a → 4b) 没有帮助，实际上会对性能造成很小的影响。
消除 ConvertTo-Csv (4a → 4c) 将执行时间减少了三分之一（153.6 秒）。
方法4a就这么多低于缓冲的 Export-Csv 方法，因为它引入了 Get-Location 和 Join-Path 的使用。这些 cmdlet 要么在幕后涉及比表面上看到的更多的处理，要么调用 cmdlet 通常很慢（当然，调用 cmdlet 时要执行一百万次）。
- 将 Get-Location 移出 ForEach-Object (4c → 4d) 将执行时间减少了四分之一（77.4 秒）。
- 使用 [System.IO.Path]::Combine() 而不是 Join-Path (4d → 4e) 将执行时间减少了 two-thirds（147.1 秒）。
脚本优化既有趣又有教育意义！