protobuf如何判断一个值是属于可选字段,还是属于另一个对象?
How does protobuf judge if a value belongs to an optional field, or another object?
例如,如果我这样定义一张照片:
$cat 30.proto
message hello
{
required int32 f1=1;
required int32 f2=2;
optional int32 f3=3;
}
如果 protobuf 可以处理这样的事情,我会加倍:
我声明了 3 个对象,每个对象都没有 f3
字段。
写入输出
那么在reader这边,reader怎么知道这6个值应该属于3个对象(每2个字段),还是属于2个对象(每个3个字段)?
换句话说,"require"/"optional"是如何体现在编码字节里面的?如果没有体现在字节流中,那么protobuf是如何确定一个新偏移量的开始呢?我们知道 protobuf 没有 "delimiter" 位。
我对此进行了简单的快速测试:
$cat 30.cpp
#include "30.pb.h"
#include<fstream>
using namespace std;
int main()
{
fstream f("./log30.data",ios::binary|ios::out);
hello p1,p2,p3,p4,p5;
p1.set_f1(1);
p1.set_f2(2);
p2.set_f1(3);
p2.set_f2(4);
p3.set_f1(5);
p3.set_f2(6);
p1.SerializeToOstream(&f);
p2.SerializeToOstream(&f);
p3.SerializeToOstream(&f);
p4.set_f1(7);
p4.set_f2(8);
p4.set_f3(9);
p5.set_f1(0xa);
p5.set_f2(0xb);
p5.set_f3(0xc);
p4.SerializeToOstream(&f);
p5.SerializeToOstream(&f);
return 0;
}
$g++ 30.cpp 30.pb.cc -lprotobuf && ./a.out && xxd log30.data
00000000: 0801 1002 0803 1004 0805 1006 0807 1008 ................
00000010: 1809 080a 100b 180c ........
我只是猜测字节流是否总是以最小的标记号开始,并随着转储字节流而增加:当遇到较小的标记号时,它认为这是一个新对象的开始。只是我的粗略猜测。
需要你的解释!
(3) Then, in reader side, how does reader know that these 6 values
should belong to 3 objects(each 2 fields), or belong to 2 objects(each
3 fields)?
In another word, how does the "require"/"optional" reflected inside
encoded bytes? If not reflected in the byte stream, then how does
protobuf determine the start of a new offset? We know protobuf don't
have "delimiter" bits.
Protobuf 没有。在将消息提供给 protobuf 之前拆分消息取决于您,程序员。
例如,运行这个程序:
#include "30.pb.h"
#include <fstream>
#include <iostream>
using namespace std;
int main()
{
fstream f("./log30.data",ios::binary|ios::out);
hello p1,p2,p3,p4,p5;
p1.set_f1(1);
p1.set_f2(2);
p2.set_f1(3);
p2.set_f2(4);
p3.set_f1(5);
p3.set_f2(6);
p1.SerializeToOstream(&f);
p2.SerializeToOstream(&f);
p3.SerializeToOstream(&f);
p4.set_f1(7);
p4.set_f2(8);
p4.set_f3(9);
p5.set_f1(0xa);
p5.set_f2(0xb);
p5.set_f3(0xc);
p4.SerializeToOstream(&f);
p5.SerializeToOstream(&f);
f.close();
f.open("./log30.data", ios::binary|ios::in);
hello hin;
hin.ParseFromIstream(&f);
cout << "f1: " << hin.f1() << ", f2: " << hin.f2() << ", f3: " << hin.f3() << "\n";
return 0;
}
您应该只看到最后一个序列化的 hello
对象的值,因为 protobuf 读取 整个 流并用新值覆盖旧值。
As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).
When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value.
...
If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same tag number.
因此无法将可选元素放入输出流中。虽然必须包括必需的。序列化和反序列化都必须知道架构(与 Avro where schema must be embedded with data 相反),因此当解析器检查所有必需字段是否都有值时,required/optional 字段的验证发生在反序列化之后。
例如,如果我这样定义一张照片:
$cat 30.proto
message hello
{
required int32 f1=1;
required int32 f2=2;
optional int32 f3=3;
}
如果 protobuf 可以处理这样的事情,我会加倍:
我声明了 3 个对象,每个对象都没有
f3
字段。写入输出
那么在reader这边,reader怎么知道这6个值应该属于3个对象(每2个字段),还是属于2个对象(每个3个字段)?
换句话说,"require"/"optional"是如何体现在编码字节里面的?如果没有体现在字节流中,那么protobuf是如何确定一个新偏移量的开始呢?我们知道 protobuf 没有 "delimiter" 位。
我对此进行了简单的快速测试:
$cat 30.cpp
#include "30.pb.h"
#include<fstream>
using namespace std;
int main()
{
fstream f("./log30.data",ios::binary|ios::out);
hello p1,p2,p3,p4,p5;
p1.set_f1(1);
p1.set_f2(2);
p2.set_f1(3);
p2.set_f2(4);
p3.set_f1(5);
p3.set_f2(6);
p1.SerializeToOstream(&f);
p2.SerializeToOstream(&f);
p3.SerializeToOstream(&f);
p4.set_f1(7);
p4.set_f2(8);
p4.set_f3(9);
p5.set_f1(0xa);
p5.set_f2(0xb);
p5.set_f3(0xc);
p4.SerializeToOstream(&f);
p5.SerializeToOstream(&f);
return 0;
}
$g++ 30.cpp 30.pb.cc -lprotobuf && ./a.out && xxd log30.data
00000000: 0801 1002 0803 1004 0805 1006 0807 1008 ................
00000010: 1809 080a 100b 180c ........
我只是猜测字节流是否总是以最小的标记号开始,并随着转储字节流而增加:当遇到较小的标记号时,它认为这是一个新对象的开始。只是我的粗略猜测。
需要你的解释!
(3) Then, in reader side, how does reader know that these 6 values should belong to 3 objects(each 2 fields), or belong to 2 objects(each 3 fields)?
In another word, how does the "require"/"optional" reflected inside encoded bytes? If not reflected in the byte stream, then how does protobuf determine the start of a new offset? We know protobuf don't have "delimiter" bits.
Protobuf 没有。在将消息提供给 protobuf 之前拆分消息取决于您,程序员。
例如,运行这个程序:
#include "30.pb.h"
#include <fstream>
#include <iostream>
using namespace std;
int main()
{
fstream f("./log30.data",ios::binary|ios::out);
hello p1,p2,p3,p4,p5;
p1.set_f1(1);
p1.set_f2(2);
p2.set_f1(3);
p2.set_f2(4);
p3.set_f1(5);
p3.set_f2(6);
p1.SerializeToOstream(&f);
p2.SerializeToOstream(&f);
p3.SerializeToOstream(&f);
p4.set_f1(7);
p4.set_f2(8);
p4.set_f3(9);
p5.set_f1(0xa);
p5.set_f2(0xb);
p5.set_f3(0xc);
p4.SerializeToOstream(&f);
p5.SerializeToOstream(&f);
f.close();
f.open("./log30.data", ios::binary|ios::in);
hello hin;
hin.ParseFromIstream(&f);
cout << "f1: " << hin.f1() << ", f2: " << hin.f2() << ", f3: " << hin.f3() << "\n";
return 0;
}
您应该只看到最后一个序列化的 hello
对象的值,因为 protobuf 读取 整个 流并用新值覆盖旧值。
As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).
When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value.
...
If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same tag number.
因此无法将可选元素放入输出流中。虽然必须包括必需的。序列化和反序列化都必须知道架构(与 Avro where schema must be embedded with data 相反),因此当解析器检查所有必需字段是否都有值时,required/optional 字段的验证发生在反序列化之后。