有没有办法从光束管道的嵌套记录中获取很少的字段？

Question

我正在读取一个 Avro 文件，该文件的嵌套架构包含太多字段。例如：employeeId、empName、empPersonalInfo.Address.city 等。我想编写一个 parDo 函数以仅从管道记录中获取几个字段（employeeId，empPersonalInfo.Address.city）

schema of an avro file is :
{
     "namespace"    : "studentjoin.avro",
     "type"         : "record",
     "name"         : "student",
     "fields"       : [
      {"name": "personalInfo",
       "type": { "type" : "array", "items": { 
           "type" : "record",                                
               "name" : "studentinfo",
           "fields": [
                 {"name": "studentId", "type": "int"},
                 {"name": "studentName",  "type": ["string", "null"]},
                 {"name": "studentAddress", "type": {
                    "type" : "array", "items" : {
                        "type": "record", "name" : "addressInfo", 
                        "fields":
                         [
                            {"name" : "streetName", "type": ["string", "null"] },
                            {"name": "city", "type": ["string","null"]}
                         ] }}},

                 {"name": "studentBranch", "type": ["string", "null"]}
                 ]
        } }
    }

    ]
}

如果没有嵌套字段，下面的代码运行完美：

fields_of_interest = (p | 'Projected' >> beam.Map( 
          lambda row: {f: row[f] for f in selected_fileld_names}))

java SDK 中有 unnest 内置函数，它首先在一个级别上转换所有嵌套字段，如果在 python 中可能有相同类型的事情，这将很有帮助。

Answer 1

pl = (pl |
      "Extract" >> beam.Map(lambda x:
         (x["student"]["personalInfo"][0]["studentInfo"]["studentId"], x["student"]["personalInfo"][0]["studentInfo"]["studentAddress"][0]["addressInfo"]))

您不能简单地展开字典，它包含列表（由 'type':'array' 指定，这意味着有不同的方式来展开它。如果有多个地址（具有多个城市名称）怎么办？Return第一个，还是全部？在上面的实现中，它returns只有第一个元素。

有没有办法从光束管道的嵌套记录中获取很少的字段？

Is there any way to get few fields from nested records of beam pipeline?

python

apache-beam