Boost::Spirit 解析器:寻找最高性能和最低内存使用率
Boost::Spirit parser: Looking for max performance and min mem usage
在问了几个关于解析复杂日志的问题后,我终于知道了这样做的最佳方法。
现在的问题是是否有一些方法可以提高性能 and/or 减少内存使用,甚至编译时间。我会要求答案满足这些限制:
MS VS 2010(不完全是 c++11,只是实现了一些功能:auto、lambdas...)和 boost 1.53(这里唯一的问题是 string_view
仍然存在不可用,但使用 string_ref
仍然有效,甚至表示将来可能会弃用它)。
日志被压缩,并且使用输出旧原始 C "char" 数组的开放库将它们直接解压缩到 ram 内存,因此不值得使用 std::string
因为内存已经被库分配了。它们有数千个,它们占了好几个 GB,所以将它们保存在内存中不是一种选择。我的意思是,由于在解析日志后删除了日志,因此无法使用 string_view
。
将日期字符串解析为 POSIX 时间可能是个好主意。只有两点:避免为此分配字符串应该很有趣,据我所知,POSIX times 不允许 ms,所以它们应该保存在另一个额外的变量中。
一些字符串在日志中重复出现(道路变量p.e)。使用一些享元模式(它的 boost 实现)来减少内存可能会很有趣,即使记住这会产生性能成本。
使用模板库时编译时间很痛苦。我真的很感激任何有助于减少它们的调整:也许将语法拆分为子语法?也许使用预编译的头文件?
这个的最终用途是查询任何事件,例如获取所有 GEAR 事件(值和时间),并在固定时间间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录:纯 "Location" 记录和 "Location + Event" 记录(我的意思是,每次解析事件时,位置也必须保存)。将它们分成两个向量可以实现快速查询,但会减慢解析速度。仅使用一个公共向量可以实现快速解析,但会减慢查询速度。对此有什么想法吗?也许提升多索引容器会像之前建议的那样有所帮助?
请不要犹豫,提供您认为可能有助于实现目标的任何解决方案或更改任何内容。
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen
typedef char const* It;
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
struct LogRecord {
int driver;
double time;
double vel;
double km;
std::string date;
std::string road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(std::string, date)
(double, time)
(int, driver)
(double, vel)
(std::string, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser() : LogParser::base_type(start) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
line_record
= '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >>
repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> raw[+graph]
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
//BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
}
private:
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
MyEvents::LogRecords parse_spirit(It b, It e) {
static QiParsers::LogParser<It> const parser;
MyEvents::LogRecords records;
parse(b, e, parser, records);
return records;
}
static char input[] =
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main() {
MyEvents::LogRecords records = parse_spirit(input, input+len);
std::cout << "Parsed: " << records.size() << " records\n";
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
return 0;
}
是 string_ref
本质上是相同的,但在某些点上使用与 std::string_view
略有不同的界面
修订版 1:POSIX 时间
存储POSIX时间原来真的很简单:
#include <boost/date_time/posix_time/posix_time_io.hpp>
接下来,替换类型:
typedef boost::posix_time::ptime Timestamp;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date; // << HERE using Timestamp now
std::string road;
Event event;
};
并将解析器简化为:
'[' >> stream >> ']'
Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
修订 #2:压缩文件
您也可以使用 IOStreams 透明地解压缩输入:
int main(int argc, char **argv) {
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
}
parse_logfile
那么可以实现为:
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
static QiParsers::LogParser<It> const parser;
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into);
}
Note: The library has zlib, gzip and bzip2 decompressors. I opted for gzip for demonstration
Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
修订版 3:字符串实习
"Interned" 字符串或 "Atoms" 是减少字符串分配的常用方法。您可以使用 Boost Flyweight,但根据我的经验,正确使用它有点复杂。那么,为什么不创建自己的抽象:
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
现在,我们需要将其集成到解析器中。我建议使用语义操作
Note: Traits could still help here, but they're static, and that would require the StringTable
to be global, which is a choice I'd never make... unless absolutely obligated
首先,更改 Ast:
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road; // << HERE using Atom now
Event event;
};
接下来,让我们创建一个规则来创建这样一个原子:
qi::rule<It, MyEvents::Atom()> atom;
atom = raw[+graph][_val = intern_(_1)];
当然,这引出了语义操作如何实现的问题:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
语法的构造函数将 intern_
函子连接到传入的 StringTable&
。
完整演示
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
typedef boost::posix_time::ptime Timestamp;
typedef StringTable::Atom Atom;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(MyEvents::Timestamp, date)
(double, time)
(int, driver)
(double, vel)
(MyEvents::Atom, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
atom = raw[+graph][_val = intern_(_1)];
line_record
= '[' >> stream >> ']'
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> atom
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
}
private:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::rule<It, MyEvents::Atom()> atom;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into, strings);
}
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main(int argc, char **argv) {
StringTable strings;
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records, strings);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
std::cout << "Interned strings: " << strings.index.size() << "\n";
std::cout << "Table backing: '";
std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
std::cout << "'\n";
for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
std::cout << " entry - " << *it << "\n";
}
}
当运行2个输入文件时,第二个与第一个略有不同:
zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz
它打印
Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))
有趣的是 interned 字符串统计:
Interned strings: 4
Table backing: 'A-11B-16'
entry - A-1
entry - A-11
entry - B-1
entry - B-16
请注意 B-1
和 A-1
如何作为 A-11
和 B-16
的子字符串进行重复数据删除,因为它们已经被驻留了。预置您的字符串表可能有助于实现最佳 re-use.
各种备注
我没有很多减少编译时间的技巧。我只是将所有 Spirit 内容放在一个单独的 TU 中,并接受该 TU 的编译时间。毕竟,这是关于用编译时间换取运行时性能。
关于字符串驻留,你最好使用flat_set<char const*>
,这样你就可以根据需要构造一个特定长度的原子。
如果所有字符串都很小,使用 small-string 优化可能(远)更好。
我会让你做比较基准测试,你可能想继续使用你自己的解压缩 + const char* 迭代器。这主要是为了表明 Boost 有它,你不需要 "read the whole file at once".
事实上,在那个主题上,您可能希望将结果存储在内存映射文件中,这样即使您超出了物理内存限制,您也可以愉快地工作。
多索引和查询
您可以在我之前的回答中找到关于此的具体示例:
特别注意通过引用获取索引的方式:
Indexing::Table idx(events.begin(), events.end());
这也可用于将 result-set 存储在另一个(索引)容器中以供 repeated/further 处理。
在问了几个关于解析复杂日志的问题后,我终于知道了这样做的最佳方法。
现在的问题是是否有一些方法可以提高性能 and/or 减少内存使用,甚至编译时间。我会要求答案满足这些限制:
MS VS 2010(不完全是 c++11,只是实现了一些功能:auto、lambdas...)和 boost 1.53(这里唯一的问题是
string_view
仍然存在不可用,但使用string_ref
仍然有效,甚至表示将来可能会弃用它)。日志被压缩,并且使用输出旧原始 C "char" 数组的开放库将它们直接解压缩到 ram 内存,因此不值得使用
std::string
因为内存已经被库分配了。它们有数千个,它们占了好几个 GB,所以将它们保存在内存中不是一种选择。我的意思是,由于在解析日志后删除了日志,因此无法使用string_view
。将日期字符串解析为 POSIX 时间可能是个好主意。只有两点:避免为此分配字符串应该很有趣,据我所知,POSIX times 不允许 ms,所以它们应该保存在另一个额外的变量中。
一些字符串在日志中重复出现(道路变量p.e)。使用一些享元模式(它的 boost 实现)来减少内存可能会很有趣,即使记住这会产生性能成本。
使用模板库时编译时间很痛苦。我真的很感激任何有助于减少它们的调整:也许将语法拆分为子语法?也许使用预编译的头文件?
这个的最终用途是查询任何事件,例如获取所有 GEAR 事件(值和时间),并在固定时间间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录:纯 "Location" 记录和 "Location + Event" 记录(我的意思是,每次解析事件时,位置也必须保存)。将它们分成两个向量可以实现快速查询,但会减慢解析速度。仅使用一个公共向量可以实现快速解析,但会减慢查询速度。对此有什么想法吗?也许提升多索引容器会像之前建议的那样有所帮助?
请不要犹豫,提供您认为可能有助于实现目标的任何解决方案或更改任何内容。
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen
typedef char const* It;
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
struct LogRecord {
int driver;
double time;
double vel;
double km;
std::string date;
std::string road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(std::string, date)
(double, time)
(int, driver)
(double, vel)
(std::string, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser() : LogParser::base_type(start) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
line_record
= '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >>
repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> raw[+graph]
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
//BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
}
private:
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
MyEvents::LogRecords parse_spirit(It b, It e) {
static QiParsers::LogParser<It> const parser;
MyEvents::LogRecords records;
parse(b, e, parser, records);
return records;
}
static char input[] =
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main() {
MyEvents::LogRecords records = parse_spirit(input, input+len);
std::cout << "Parsed: " << records.size() << " records\n";
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
return 0;
}
是 string_ref
本质上是相同的,但在某些点上使用与 std::string_view
略有不同的界面
修订版 1:POSIX 时间
存储POSIX时间原来真的很简单:
#include <boost/date_time/posix_time/posix_time_io.hpp>
接下来,替换类型:
typedef boost::posix_time::ptime Timestamp;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date; // << HERE using Timestamp now
std::string road;
Event event;
};
并将解析器简化为:
'[' >> stream >> ']'
Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
修订 #2:压缩文件
您也可以使用 IOStreams 透明地解压缩输入:
int main(int argc, char **argv) {
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
}
parse_logfile
那么可以实现为:
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
static QiParsers::LogParser<It> const parser;
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into);
}
Note: The library has zlib, gzip and bzip2 decompressors. I opted for gzip for demonstration
Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
修订版 3:字符串实习
"Interned" 字符串或 "Atoms" 是减少字符串分配的常用方法。您可以使用 Boost Flyweight,但根据我的经验,正确使用它有点复杂。那么,为什么不创建自己的抽象:
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
现在,我们需要将其集成到解析器中。我建议使用语义操作
Note: Traits could still help here, but they're static, and that would require the
StringTable
to be global, which is a choice I'd never make... unless absolutely obligated
首先,更改 Ast:
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road; // << HERE using Atom now
Event event;
};
接下来,让我们创建一个规则来创建这样一个原子:
qi::rule<It, MyEvents::Atom()> atom;
atom = raw[+graph][_val = intern_(_1)];
当然,这引出了语义操作如何实现的问题:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
语法的构造函数将 intern_
函子连接到传入的 StringTable&
。
完整演示
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
typedef boost::posix_time::ptime Timestamp;
typedef StringTable::Atom Atom;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(MyEvents::Timestamp, date)
(double, time)
(int, driver)
(double, vel)
(MyEvents::Atom, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
atom = raw[+graph][_val = intern_(_1)];
line_record
= '[' >> stream >> ']'
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> atom
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
}
private:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::rule<It, MyEvents::Atom()> atom;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into, strings);
}
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main(int argc, char **argv) {
StringTable strings;
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records, strings);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
std::cout << "Interned strings: " << strings.index.size() << "\n";
std::cout << "Table backing: '";
std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
std::cout << "'\n";
for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
std::cout << " entry - " << *it << "\n";
}
}
当运行2个输入文件时,第二个与第一个略有不同:
zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz
它打印
Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))
有趣的是 interned 字符串统计:
Interned strings: 4
Table backing: 'A-11B-16'
entry - A-1
entry - A-11
entry - B-1
entry - B-16
请注意 B-1
和 A-1
如何作为 A-11
和 B-16
的子字符串进行重复数据删除,因为它们已经被驻留了。预置您的字符串表可能有助于实现最佳 re-use.
各种备注
我没有很多减少编译时间的技巧。我只是将所有 Spirit 内容放在一个单独的 TU 中,并接受该 TU 的编译时间。毕竟,这是关于用编译时间换取运行时性能。
关于字符串驻留,你最好使用flat_set<char const*>
,这样你就可以根据需要构造一个特定长度的原子。
如果所有字符串都很小,使用 small-string 优化可能(远)更好。
我会让你做比较基准测试,你可能想继续使用你自己的解压缩 + const char* 迭代器。这主要是为了表明 Boost 有它,你不需要 "read the whole file at once".
事实上,在那个主题上,您可能希望将结果存储在内存映射文件中,这样即使您超出了物理内存限制,您也可以愉快地工作。
多索引和查询
您可以在我之前的回答中找到关于此的具体示例:
特别注意通过引用获取索引的方式:
Indexing::Table idx(events.begin(), events.end());
这也可用于将 result-set 存储在另一个(索引)容器中以供 repeated/further 处理。