有没有办法从光束管道的嵌套记录中获取很少的字段?
Is there any way to get few fields from nested records of beam pipeline?
我正在读取一个 Avro 文件,该文件的嵌套架构包含太多字段。例如:employeeId、empName、empPersonalInfo.Address.city 等。我想编写一个 parDo 函数以仅从管道记录中获取几个字段(employeeId,empPersonalInfo.Address.city)
schema of an avro file is :
{
"namespace" : "studentjoin.avro",
"type" : "record",
"name" : "student",
"fields" : [
{"name": "personalInfo",
"type": { "type" : "array", "items": {
"type" : "record",
"name" : "studentinfo",
"fields": [
{"name": "studentId", "type": "int"},
{"name": "studentName", "type": ["string", "null"]},
{"name": "studentAddress", "type": {
"type" : "array", "items" : {
"type": "record", "name" : "addressInfo",
"fields":
[
{"name" : "streetName", "type": ["string", "null"] },
{"name": "city", "type": ["string","null"]}
] }}},
{"name": "studentBranch", "type": ["string", "null"]}
]
} }
}
]
}
如果没有嵌套字段,下面的代码运行完美:
fields_of_interest = (p | 'Projected' >> beam.Map(
lambda row: {f: row[f] for f in selected_fileld_names}))
java SDK 中有 unnest 内置函数,它首先在一个级别上转换所有嵌套字段,如果在 python 中可能有相同类型的事情,这将很有帮助。
pl = (pl |
"Extract" >> beam.Map(lambda x:
(x["student"]["personalInfo"][0]["studentInfo"]["studentId"], x["student"]["personalInfo"][0]["studentInfo"]["studentAddress"][0]["addressInfo"]))
您不能简单地展开字典,它包含列表(由 'type':'array'
指定,这意味着有不同的方式来展开它。如果有多个地址(具有多个城市名称)怎么办?Return第一个,还是全部?在上面的实现中,它returns只有第一个元素。
我正在读取一个 Avro 文件,该文件的嵌套架构包含太多字段。例如:employeeId、empName、empPersonalInfo.Address.city 等。我想编写一个 parDo 函数以仅从管道记录中获取几个字段(employeeId,empPersonalInfo.Address.city)
schema of an avro file is :
{
"namespace" : "studentjoin.avro",
"type" : "record",
"name" : "student",
"fields" : [
{"name": "personalInfo",
"type": { "type" : "array", "items": {
"type" : "record",
"name" : "studentinfo",
"fields": [
{"name": "studentId", "type": "int"},
{"name": "studentName", "type": ["string", "null"]},
{"name": "studentAddress", "type": {
"type" : "array", "items" : {
"type": "record", "name" : "addressInfo",
"fields":
[
{"name" : "streetName", "type": ["string", "null"] },
{"name": "city", "type": ["string","null"]}
] }}},
{"name": "studentBranch", "type": ["string", "null"]}
]
} }
}
]
}
如果没有嵌套字段,下面的代码运行完美:
fields_of_interest = (p | 'Projected' >> beam.Map(
lambda row: {f: row[f] for f in selected_fileld_names}))
java SDK 中有 unnest 内置函数,它首先在一个级别上转换所有嵌套字段,如果在 python 中可能有相同类型的事情,这将很有帮助。
pl = (pl |
"Extract" >> beam.Map(lambda x:
(x["student"]["personalInfo"][0]["studentInfo"]["studentId"], x["student"]["personalInfo"][0]["studentInfo"]["studentAddress"][0]["addressInfo"]))
您不能简单地展开字典,它包含列表(由 'type':'array'
指定,这意味着有不同的方式来展开它。如果有多个地址(具有多个城市名称)怎么办?Return第一个,还是全部?在上面的实现中,它returns只有第一个元素。