Boost::Spirit 解析器:寻找最高性能和最低内存使用率

Boost::Spirit parser: Looking for max performance and min mem usage

在问了几个关于解析复杂日志的问题后,我终于知道了这样做的最佳方法。

现在的问题是是否有一些方法可以提高性能 and/or 减少内存使用,甚至编译时间。我会要求答案满足这些限制:

  1. MS VS 2010(不完全是 c++11,只是实现了一些功能:auto、lambdas...)和 boost 1.53(这里唯一的问题是 string_view 仍然存在不可用,但使用 string_ref 仍然有效,甚至表示将来可能会弃用它)。

  2. 日志被压缩,并且使用输出旧原始 C "char" 数组的开放库将它们直接解压缩到 ram 内存,因此不值得使用 std::string 因为内存已经被库分配了。它们有数千个,它们占了好几个 GB,所以将它们保存在内存中不是一种选择。我的意思是,由于在解析日志后删除了日志,因此无法使用 string_view

  3. 将日期字符串解析为 POSIX 时间可能是个好主意。只有两点:避免为此分配字符串应该很有趣,据我所知,POSIX times 不允许 ms,所以它们应该保存在另一个额外的变量中。

  4. 一些字符串在日志中重复出现(道路变量p.e)。使用一些享元模式(它的 boost 实现)来减少内存可能会很有趣,即使记住这会产生性能成本。

  5. 使用模板库时编译时间很痛苦。我真的很感激任何有助于减少它们的调整:也许将语法拆分为子语法?也许使用预编译的头文件?

  6. 这个的最终用途是查询任何事件,例如获取所有 GEAR 事件(值和时间),并在固定时间间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录:纯 "Location" 记录和 "Location + Event" 记录(我的意思是,每次解析事件时,位置也必须保存)。将它们分成两个向量可以实现快速查询,但会减慢解析速度。仅使用一个公共向量可以实现快速解析,但会减慢查询速度。对此有什么想法吗?也许提升多索引容器会像之前建议的那样有所帮助?

请不要犹豫,提供您认为可能有助于实现目标的任何解决方案或更改任何内容。

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen

typedef char const* It;

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    struct LogRecord {
        int driver;        
        double time;
        double vel;
        double km;
        std::string date;
        std::string road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
    (MyEvents::Kind, kind)
    (double, value))


BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (std::string, date)
        (double, time)
        (int, driver)
        (double, vel)
        (std::string, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser() : LogParser::base_type(start) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            line_record
                = '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >> 
                             repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> raw[+graph]
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            //BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
        }

      private:
        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

MyEvents::LogRecords parse_spirit(It b, It e) {
    static QiParsers::LogParser<It> const parser;

    MyEvents::LogRecords records;
    parse(b, e, parser, records);

    return records;
}

static char input[] = 
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main() {
    MyEvents::LogRecords records = parse_spirit(input, input+len);
    std::cout << "Parsed: " << records.size() << " records\n";

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    return 0;
}

string_ref 本质上是相同的,但在某些点上使用与 std::string_view 略有不同的界面

修订版 1:POSIX 时间

存储POSIX时间原来真的很简单:

#include <boost/date_time/posix_time/posix_time_io.hpp>

接下来,替换类型:

typedef boost::posix_time::ptime Timestamp;

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;    // << HERE using Timestamp now
    std::string road;
    Event event;
};

并将解析器简化为:

'[' >> stream >> ']'

打印 Live On Coliru

Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订 #2:压缩文件

您也可以使用 IOStreams 透明地解压缩输入:

int main(int argc, char **argv) {
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 
}

parse_logfile 那么可以实现为:

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
    static QiParsers::LogParser<It> const parser;

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into);
}

Note: The library has zlib, gzip and bzip2 decompressors. I opted for gzip for demonstration

打印 Live On Coliru

Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订版 3:字符串实习

"Interned" 字符串或 "Atoms" 是减少字符串分配的常用方法。您可以使用 Boost Flyweight,但根据我的经验,正确使用它有点复杂。那么,为什么不创建自己的抽象:

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

现在,我们需要将其集成到解析器中。我建议使用语义操作

Note: Traits could still help here, but they're static, and that would require the StringTable to be global, which is a choice I'd never make... unless absolutely obligated

首先,更改 Ast:

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;
    Atom road;       // << HERE using Atom now
    Event event;
};

接下来,让我们创建一个规则来创建这样一个原子:

qi::rule<It, MyEvents::Atom()> atom;

atom = raw[+graph][_val = intern_(_1)];

当然,这引出了语义操作如何实现的问题:

struct intern_f {
    StringTable& _table;

    typedef StringTable::Atom result_type;
    explicit intern_f(StringTable& table) : _table(table) {}

    StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
        return _table.intern(sequential(range));
    }

  private:
    // be more efficient if It is const char*
    static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
        return boost::string_ref(range.begin(), range.size());
    }
    template <typename OtherIt>
    static std::string sequential(boost::iterator_range<OtherIt> const& range) {
        return std::string(range.begin(), range.end());
    }
};
boost::phoenix::function<intern_f> intern_;

语法的构造函数将 intern_ 函子连接到传入的 StringTable&

完整演示

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    typedef boost::posix_time::ptime Timestamp;
    typedef StringTable::Atom Atom;

    struct LogRecord {
        int driver;
        double time;
        double vel;
        double km;
        Timestamp date;
        Atom road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
        (MyEvents::Kind, kind)
        (double, value))

BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (MyEvents::Timestamp, date)
        (double, time)
        (int, driver)
        (double, vel)
        (MyEvents::Atom, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            atom = raw[+graph][_val = intern_(_1)];

            line_record
                = '[' >> stream >> ']'
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> atom
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
        }

      private:
        struct intern_f {
            StringTable& _table;

            typedef StringTable::Atom result_type;
            explicit intern_f(StringTable& table) : _table(table) {}

            StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
                return _table.intern(sequential(range));
            }

          private:
            // be more efficient if It is const char*
            static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
                return boost::string_ref(range.begin(), range.size());
            }
            template <typename OtherIt>
            static std::string sequential(boost::iterator_range<OtherIt> const& range) {
                return std::string(range.begin(), range.end());
            }
        };
        boost::phoenix::function<intern_f> intern_;

        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;
        qi::rule<It, MyEvents::Atom()> atom;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
    QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into, strings);
}

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main(int argc, char **argv) {
    StringTable strings;
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records, strings);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    std::cout << "Interned strings: " << strings.index.size() << "\n";
    std::cout << "Table backing: '";
    std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
    std::cout << "'\n";
    for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
        std::cout << " entry - " << *it << "\n";
    }
}

当运行2个输入文件时,第二个与第一个略有不同:

zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz

它打印

Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))

有趣的是 interned 字符串统计:

Interned strings: 4
Table backing: 'A-11B-16'
 entry - A-1
 entry - A-11
 entry - B-1
 entry - B-16

请注意 B-1A-1 如何作为 A-11B-16 的子字符串进行重复数据删除,因为它们已经被驻留了。预置您的字符串表可能有助于实现最佳 re-use.

各种备注

我没有很多减少编译时间的技巧。我只是将所有 Spirit 内容放在一个单独的 TU 中,并接受该 TU 的编译时间。毕竟,这是关于用编译时间换取运行时性能。

关于字符串驻留,你最好使用flat_set<char const*>,这样你就可以根据需要构造一个特定长度的原子。

如果所有字符串都很小,使用 small-string 优化可能(远)更好。

我会让你做比较基准测试,你可能想继续使用你自己的解压缩 + const char* 迭代器。这主要是为了表明 Boost 有它,你不需要 "read the whole file at once".

事实上,在那个主题上,您可能希望将结果存储在内存映射文件中,这样即使您超出了物理内存限制,您也可以愉快地工作。

多索引和查询

您可以在我之前的回答中找到关于此的具体示例:

特别注意通过引用获取索引的方式:

Indexing::Table idx(events.begin(), events.end());

这也可用于将 result-set 存储在另一个(索引)容器中以供 repeated/further 处理。