如何提高大型 txt 处理脚本的速度？

Question

我有一个程序可以扫描一个非常大的 txt 文件 (.pts file actually)，如下所示：

437288479
-6.9465 -20.49 -1.3345 70
-6.6835 -20.82 -1.3335 83
-7.3105 -20.179 -1.3325 77
-7.1005 -20.846 -1.3295 96
-7.3645 -20.759 -1.2585 79
...

第一行是文件中包含的点数，每隔一行对应 3D space 中的一个 {x,y,z,intensity} 点。上面的文件是 ~11 GB，但我还有更多文件要处理，最多可达 ~50 GB。

这是我用来读取此文件的代码：

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};


void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("my_big_file.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);
    std::string line;

    // get rid of the first line
    std::getline(is, line);
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > minThreshold && p.z < maxThreshold)
        {
            // do something with p and store it in the vector of tuples
            // O(n) complexity
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

对于我的 11 GB 文件，扫描所有行并将数据存储在 point p 中需要 ~13 minutes 来执行。有没有办法让它更快？因为每次我扫描一个点，我也必须用它做一些事情。这将使我的程序最终需要几个小时才能执行。

我开始考虑使用多个内核，但如果某些点出于某种原因链接在一起，似乎会出现问题。如果您对如何进行有任何建议，我很乐意听取。

Edit1 ：我在运行笔记本电脑上运行程序，CPU 包含 8 cores - 2.9GHz，ram 是 16GB，我正在使用 ssd。为此，该程序必须运行在类似的硬件上。

Edit2：这是完整的程序，因此您可以告诉我哪里做错了。我将每个点定位在一种名为 slab 的二维网格中。每个单元格将包含一定数量的点和一个 z 平均值。

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};

/*
    Compute Slab
*/

float slabBox[6] = {-25.,25.,-25.,25.,-1.,0.};
float dx = 0.1;
float dy = 0.1;
int slabSizeX = (slabBox[1] - slabBox[0]) / dx;
int slabSizeY = (slabBox[3] - slabBox[2]) / dy;

std::vector<std::tuple<double, double, double, int>> initSlab() 
{
    // initialize the slab vector according to the grid size
    std::vector<std::tuple<double, double, double, int>> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});

    // fill the vector with {x,y} cells coordinates
    for (int y = 0; y < slabSizeY; y++)
    {
        for (int x = 0; x < slabSizeX; x++)
        {
            slabVector[x + y * slabSizeX] = {x * dx + slabBox[0], y * dy + slabBox[2], 0., 0};
        }
    }
    return slabVector;
}

std::vector<std::tuple<double, double, double, int>> addPoint2Slab(point p, std::vector<std::tuple<double, double, double, int>> slabVector)
{
    // find the region {x,y} in the slab in which coord {p.x,p.y} is
    int x = (int) floor((p.x - slabBox[0])/dx);
    int y = (int) floor((p.y - slabBox[2])/dy);
    
    // calculate the new z value
    double z = (std::get<2>(slabVector[x + y * slabSizeX]) * std::get<3>(slabVector[x + y * slabSizeX]) + p.z) / (std::get<3>(slabVector[x + y * slabSizeX]) + 1);

    // replace the older z
    std::get<2>(slabVector[x + y * slabSizeX]) = z;

    // add + 1 point in the cell
    std::get<3>(slabVector[x + y * slabSizeX])++;
    return slabVector;
}

/*
    Parse the file 
*/

void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("my_big_file.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);

    std::string line;
    std::getline(is, line);

    auto slabVector = initSlab();
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > slabBox[4] && p.z < slabBox[5])
        {
            slabVector = addPoint2Slab(p, slabVector);
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

Answer 1

如果您使用 HDD 来存储您的文件，那么以 100Mb/s 的速度读取将花费大约 2 分钟，这是一个很好的案例。尝试读取文件的一个块并在另一个线程中处理它，而下一个块将被读取。

此外，您还有类似的内容：

std::vector<...> addPoint2Slab(point, std::vector<...> result)
{
    ...
    return result;
}

slabVector = addPoint2Slab(point, slabVector);

我想它会在每次调用时带来不必要的 slabVector 复制（实际上，编译器可能会优化它）。如果您按以下方式传递矢量，请尝试检查速度：

std::vector<...> addPoint2Slab(point, std::vector<...> & result);

然后调用：

addPoint2Slab(point, slabVector);

如果它会获得速度奖励，您可以检查如何在没有开销的情况下转发结果。

Answer 2

干掉std::getline。 iostreams 与字符串的直接“内存中”处理相比相当慢。也不要使用 sscanf。

分配大块内存，即 128MB 或更多。在一次调用中从文件中读取所有内容。然后解析这个块，直到你到达终点。

有点像这样：

std::vector<char> huge_chunk(128*1024*1024);
ifstream in("my_file");
do {
   in.read(huge_chunk.data(), huge_chunk.size());
   parse(huge_chunk.data, in.gcount());
} while (in.good());

你明白了。

用strtof、find等解析chunk

解析块会在块的末尾留下几个字符，这些字符不构成完整的一行。您需要暂时存储它们并从那里继续解析下一个块。

一般来说：调用ifstream越少越好。并且使用“较低的 API”函数，例如 strtof、strtoul 等...通常比 sscanf、format 等更快...

这对于 <1MB 的小文件通常无关紧要，但对于非常大的文件可能会产生巨大的影响。

此外：使用探查器准确找出您的程序正在等待的位置。英特尔的 VTune 分析器是免费的，afaik。它是 OneAPI 工具包的一部分，是我所知道的最好的工具之一。

Answer 3

使用内存映射很好。使用 IOStreams 不是。这是使用 Boost Spirit 进行解析的完整示例：

入门简单

我建议对类型名称进行一些清理

using Record = std::tuple<double, double, double, int>;

std::vector<Record> initSlab()
{
    // initialize the slab vector according to the grid size
    std::vector<Record> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});

    // fill the vector with {x,y} cells coordinates
    for (int y = 0; y < slabSizeY; y++) {
        for (int x = 0; x < slabSizeX; x++) {
            slabVector[x + y * slabSizeX] = {
                x * dx + slabBox[0],
                y * dy + slabBox[2],
                0.,
                0,
            };
        }
    }
    return slabVector;
}

You could just use a struct instead of the tuple, but that's an exercise for the reader

不要一直复制 SlabVector

您 addPoint2Slab 按值获取 slabVector（复制）并返回修改后的向量。即使针对几个动作进行了优化，它仍然至少在每次调用 addPoint2Slab 时分配一个临时副本。相反，让它成为预期的变异函数：

void addPoint2Slab(point const p, std::vector<Record>& slabVector)
{
    // find the region {x,y} in the slab in which coord {p.x,p.y} is
    int x = (int) floor((p.x - slabBox[0])/dx);
    int y = (int) floor((p.y - slabBox[2])/dy);
    auto& [ix, iy, iz, icount] = slabVector[x + y * slabSizeX];

    iz = (iz * icount + p.z) / (icount + 1);
    icount += 1;
}

Note also that the tuple handling has been greatly simplified with structured bindings. You can even see what the code is doing, which was nearly impossible before - let alone verify.

读取映射文件

auto readMappedFile(std::string fname)
{
    auto slabVector = initSlab();

    boost::iostreams::mapped_file_source mmap(fname);

    auto handle = [&](auto& ctx) {
        using boost::fusion::at_c;
        point p{at_c<0>(_attr(ctx)), at_c<1>(_attr(ctx)), at_c<2>(_attr(ctx))};
        //auto intensity = at_c<3>(_attr(ctx));

        if (p.z > slabBox[4] && p.z < slabBox[5])
            addPoint2Slab(p, slabVector);
    };

    namespace x3 = boost::spirit::x3;
    static auto const line_ =
        x3::float_ >> x3::float_ >> x3::float_ >> x3::int_;

    auto first = mmap.data(), last = first + mmap.size();
    try {
        bool ok = x3::phrase_parse( //
            first, last,
            x3::expect[x3::uint_ >> x3::eol] //
                >> line_[handle] % x3::eol   //
                // expect EOF here
                >> *x3::eol >> x3::expect[x3::eoi], //
            x3::blank);

        // ok is true due to the expectation points
        assert(ok);
    } catch (x3::expectation_failure<char const*> const& ef) {
        auto where = ef.where();
        auto till  = std::min(last, where + 32);
        throw std::runtime_error("Expected " + ef.which() + " at #" +
                                 std::to_string(where - mmap.data()) + " '" +
                                 std::string(where, till) + "'...");
    }

    return slabVector;
}

在这里，我们使用 Boost Spirit X3 生成一个解析器，该解析器读取行并在每一行上调用 handle，就像您之前所做的那样。添加了少量错误处理。

让我们测试一下

这是我使用的测试驱动程序

#include <fmt/ranges.h>
#include <fstream>
#include <random>
#include <ranges>
using std::ranges::views::filter;

int main()
{
    std::string const fname = "T032_OSE.pts";
#if 0 || defined(GENERATE)
    using namespace std;
    // generates a ~12Gib file
    ofstream ofs(fname);
    mt19937  prng{random_device{}()};
    uniform_real_distribution<float> x(-25, 25), y(-25, +25), z(-1, 0);
    uniform_int_distribution<>       n(0, 100);
    auto N = 437288479;
    ofs << N << "\n";
    while (N--)
        ofs << x(prng) << " " << y(prng) << " " << z(prng) << " " << n(prng) << "\n";
#else
    auto sv        = readMappedFile(fname);
    auto has_count = [](Record const& tup) { return get<3>(tup) > 0; };
    fmt::print("slabVector:\n{}\n", fmt::join(sv | filter(has_count), "\n"));
#endif
}

注意如何使用条件编译代码生成输入文件（因为我没有你的大文件）。

在这个 ~13GiB file (compressed copy online) 上，它在我的机器上以 1 分 14 秒的速度运行：

slabVector:
(-25, -25, -0.49556059843940164, 1807)
(-24.899999618530273, -25, -0.48971092838941654, 1682)
(-24.799999237060547, -25, -0.49731256076256386, 1731)
(-24.700000762939453, -25, -0.5006042266973916, 1725)
(-24.600000381469727, -25, -0.5000671732885645, 1784)
(-24.5, -25, -0.4940826157717386, 1748)
(-24.399999618530273, -25, -0.5045350563593015, 1720)
(-24.299999237060547, -25, -0.5088279537549671, 1812)
(-24.200000762939453, -25, -0.5065565364794715, 1749)
(-24.100000381469727, -25, -0.4933392542558793, 1743)
(-24, -25, -0.4947248105973453, 1808)
(-23.899999618530273, -25, -0.48640208470636714, 1696)
(-23.799999237060547, -25, -0.4994672590531847, 1711)
(-23.700000762939453, -25, -0.5033631130808075, 1782)
(-23.600000381469727, -25, -0.4995593140170436, 1760)
(-23.5, -25, -0.5009948279948179, 1737)
(-23.399999618530273, -25, -0.4995986820225158, 1732)
(-23.299999237060547, -25, -0.49833906199795897, 1764)
(-23.200000762939453, -25, -0.5013796942594327, 1728)
(-23.100000381469727, -25, -0.5072275248223541, 1700)
(-23, -25, -0.4949060352670081, 1749)
(-22.899999618530273, -25, -0.5026246990689665, 1740)
(-22.799999237060547, -25, -0.493411989775698, 1746)
// ... ~25k lines skipped...
(24.200000762939453, 24.900001525878906, -0.508382879738258, 1746)
(24.299999237060547, 24.900001525878906, -0.5064457874896565, 1740)
(24.400001525878906, 24.900001525878906, -0.4990733400392924, 1756)
(24.5, 24.900001525878906, -0.5063144518978036, 1732)
(24.60000228881836, 24.900001525878906, -0.49988387744959534, 1855)
(24.700000762939453, 24.900001525878906, -0.49970549673984693, 1719)
(24.799999237060547, 24.900001525878906, -0.48656442707683384, 1744)
(24.900001525878906, 24.900001525878906, -0.49267272688797675, 1705)

剩余笔记

注意数值错误。您在某些地方使用了 float，但是对于这么大的数据集，您很可能会在运行平均值计算中出现明显的大数值错误。考虑切换到 [long] double 或使用“专业”累加器（许多现有的相关框架或 Boost Accumulator 会做得更好）。

完整代码

Live On Compiler Explorer

#include <algorithm>
#include <chrono>
#include <cmath>
#include <iostream>
#include <tuple>
#include <vector>
#include <fmt/ranges.h>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point { double x, y, z; };

/*
    Compute Slab
*/
using Float = float; //

Float slabBox[6] = {-25.,25.,-25.,25.,-1.,0.};
Float dx = 0.1;
Float dy = 0.1;
int slabSizeX = (slabBox[1] - slabBox[0]) / dx;
int slabSizeY = (slabBox[3] - slabBox[2]) / dy;

using Record = std::tuple<double, double, double, int>;

std::vector<Record> initSlab()
{
    // initialize the slab vector according to the grid size
    std::vector<Record> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});

    // fill the vector with {x,y} cells coordinates
    for (int y = 0; y < slabSizeY; y++) {
        for (int x = 0; x < slabSizeX; x++) {
            slabVector[x + y * slabSizeX] = {
                x * dx + slabBox[0],
                y * dy + slabBox[2],
                0.,
                0,
            };
        }
    }
    return slabVector;
}

void addPoint2Slab(point const p, std::vector<Record>& slabVector)
{
    // find the region {x,y} in the slab in which coord {p.x,p.y} is
    int x = (int) floor((p.x - slabBox[0])/dx);
    int y = (int) floor((p.y - slabBox[2])/dy);
    auto& [ix, iy, iz, icount] = slabVector[x + y * slabSizeX];

    iz = (iz * icount + p.z) / (icount + 1);
    icount += 1;
}

/* Parse the file */
#include <boost/spirit/home/x3.hpp>

auto readMappedFile(std::string fname)
{
    auto slabVector = initSlab();

    boost::iostreams::mapped_file_source mmap(fname);

    auto handle = [&](auto& ctx) {
        using boost::fusion::at_c;
        point p{at_c<0>(_attr(ctx)), at_c<1>(_attr(ctx)), at_c<2>(_attr(ctx))};
        //auto intensity = at_c<3>(_attr(ctx));

        if (p.z > slabBox[4] && p.z < slabBox[5])
            addPoint2Slab(p, slabVector);
    };

    namespace x3 = boost::spirit::x3;
    static auto const line_ =
        x3::double_ >> x3::double_ >> x3::double_ >> x3::int_;

    auto first = mmap.data(), last = first + mmap.size();
    try {
        bool ok = x3::phrase_parse( //
            first, last,
            x3::expect[x3::uint_ >> x3::eol] //
                >> line_[handle] % x3::eol   //
                // expect EOF here
                >> *x3::eol >> x3::expect[x3::eoi], //
            x3::blank);

        // ok is true due to the expectation points
        assert(ok);
    } catch (x3::expectation_failure<char const*> const& ef) {
        auto where = ef.where();
        auto till  = std::min(last, where + 32);
        throw std::runtime_error("Expected " + ef.which() + " at #" +
                                 std::to_string(where - mmap.data()) + " '" +
                                 std::string(where, till) + "'...");
    }

    return slabVector;
}

#include <fmt/ranges.h>
#include <fstream>
#include <random>
#include <ranges>
using std::ranges::views::filter;

int main()
{
    std::string const fname = "T032_OSE.pts";
#if 0 || defined(GENERATE)
    using namespace std;
    // generates a ~12Gib file
    ofstream ofs(fname);
    mt19937  prng{random_device{}()};
    uniform_real_distribution<Float> x(-25, 25), y(-25, +25), z(-1, 0);
    uniform_int_distribution<>       n(0, 100);
    auto N = 437288479;
    ofs << N << "\n";
    while (N--)
        ofs << x(prng) << " " << y(prng) << " " << z(prng) << " " << n(prng) << "\n";
#else
    auto sv        = readMappedFile(fname);
    auto has_count = [](Record const& tup) { return get<3>(tup) > 0; };
    fmt::print("slabVector:\n{}\n", fmt::join(sv | filter(has_count), "\n"));
#endif
}