在 GZ 内流式传输 HTTPS GET 加载和解析 XML

Question

我需要在从 HTTPS 获取流的过程中处理（如果可能的话）GZ 内的 XML。如果保存结果文件非常大：23 GB。

现在我使用流从 HTTPS 获取数据并将文件保存到存储器中。由于 Python 脚本需要作为批处理作业部署在 AWS 上，因此存储不是一个选项。而且我更喜欢不使用 S3 服务作为存储。

算法应该是：

 while stream GET HTTPS in chunk:
   - get xml chunk from GZ chunk
   - process xml chunk

XML 例如有下一个结构：

<List>
<Property>
     <id = '123>
     <PhotoProperties>
          <Photo>
              <url = 'https://www.url.com/photo/1.jpg>
          </Photo>
      </PhotoProperties>
</Property>
<Property>...</Property>

我需要将数据提取为列表

@dataclass
class Picture:
   id: int
   url: str

Answer 1

是的，这是可能的。关键是所有操作都支持流式传输，并且有库可以这样做：

urllib.request 用于流式传输内容
zlib 可用于解压缩 gzip 流
关于 xml 解析，了解解析 xml 文件有两种主要方法是关键：
- DOM解析：当一个完整的xml可以存储在内存中时很有用。这允许轻松操作和发现您的 xml 内容。
- SAX 解析：在 xml 无法存储在内存中的情况下很有用，例如因为它太大或者因为您想在阅读完整流之前开始处理。这就是您的情况所需要的。 xml.parsers.expat 可用于此。

我根据您的示例创建了一个（格式良好的）xml 片段：

<?xml version="1.0" encoding="UTF-8"?>
<List>
    <Property id = "123">
        <PhotoProperties>
            <Photo url = "https://www.url.com/photo/1.jpg"/>
        </PhotoProperties>
    </Property>
    <Property id = "456">
        <PhotoProperties>
            <Photo url = "https://www.url.com/photo/2.jpg"/>
        </PhotoProperties>
    </Property>
</List>

因为你没有在内存中加载完整的xml，所以解析它有点复杂。您需要创建处理程序，例如在xml 元素被打开或关闭。在下面的示例中，我将这些处理程序放在一个 class 中，它在 Picture 对象中保持状态并在找到关闭标记时打印它：

import urllib.request
import zlib
import xml.parsers.expat
from dataclasses import dataclass

URL='https://some.url.com/pictures.xml'

@dataclass
class Picture:
   id: int
   url: str

class ParseHandler:
    def __init__(self):
        self.currentPicture = None

    def start_element(self, name, attrs):
        if (name=='Property'):
            self.currentPicture = Picture(attrs['id'], None)
        elif (name=='Photo'):
            self.currentPicture.url=attrs['url']

    def end_element(self, name):
        if (name=='Property'):
            print(self.currentPicture)
            self.currentPicture=None

handler = ParseHandler()

parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element

decompressor = zlib.decompressobj(32 + zlib.MAX_WBITS)

with urllib.request.urlopen(URL) as stream:
    for gzchunk in stream:
        xmlchunk = decompressor.decompress(gzchunk)
        parser.Parse(xmlchunk)

在 GZ 内流式传输 HTTPS GET 加载和解析 XML

Stream HTTPS GET loading and parsing XML inside GZ

python

xml

gzip

stream

amazon-web-services