如何使用 goroutines 解码 XML

Question

我正在进行概念验证，以调查解析具有一定数量实体的 XML 文档所需的时间。

首先，我的结构包含 XML 文档中的条目：

type Node struct {
    ID             int    `xml:"id,attr"`
    Position       int    `xml:"position,attr"`
    Depth          int    `xml:"depth,attr"`
    Parent         string `xml:"parent,attr"`
    Name           string `xml:"Name"`
    Description    string `xml:"Description"`
    OwnInformation struct {
        Title       string `xml:"Title"`
        Description string `xml:"Description"`
    } `xml:"OwnInformation"`
    Assets []struct {
        ID           string `xml:"id,attr"`
        Position     int    `xml:"position,attr"`
        Type         string `xml:"type,attr"`
        Category     int    `xml:"category,attr"`
        OriginalFile string `xml:"OriginalFile"`
        Description  string `xml:"Description"`
        URI          string `xml:"Uri"`
    } `xml:"Assets>Asset"`
    Synonyms []string `xml:"Synonyms>Synonym"`
}

接下来，我有一个可以生成任意给定数量元素的工厂：

func CreateNodeXMLDocumentBytes(
    nodeElementCount int) []byte {

    xmlContents := new(bytes.Buffer)

    xmlContents.WriteString("<ROOT>\n")

    for iterationCounter := 0; iterationCounter < nodeElementCount; iterationCounter++ {
        appendNodeXMLElement(iterationCounter, xmlContents)
    }

    xmlContents.WriteString("</ROOT>")

    return xmlContents.Bytes()
}

// PRIVATE: appendNodeXMLElement appends a '<Node />' elements to an existing bytes.Buffer instance.
func appendNodeXMLElement(
    counter int,
    xmlDocument *bytes.Buffer) {

    xmlDocument.WriteString("<Node id=\"" + strconv.Itoa(counter) + "\" position=\"0\" depth=\"0\" parent=\"0\">\n")
    xmlDocument.WriteString("    <Name>Name</Name>\n")
    xmlDocument.WriteString("    <Description>Description</Description>\n")
    xmlDocument.WriteString("    <OwnInformation>\n")
    xmlDocument.WriteString("        <Title>Title</Title>\n")
    xmlDocument.WriteString("        <Description>Description</Description>\n")
    xmlDocument.WriteString("    </OwnInformation>\n")
    xmlDocument.WriteString("    <Assets>\n")
    xmlDocument.WriteString("        <Asset id=\"0\" position=\"0\" type=\"0\" category=\"0\">\n")
    xmlDocument.WriteString("            <OriginalFile>OriginalFile</OriginalFile>\n")
    xmlDocument.WriteString("            <Description>Description</Description>\n")
    xmlDocument.WriteString("            <Uri>Uri</Uri>\n")
    xmlDocument.WriteString("        </Asset>\n")
    xmlDocument.WriteString("        <Asset id=\"1\" position=\"1\" type=\"1\" category=\"1\">\n")
    xmlDocument.WriteString("            <OriginalFile>OriginalFile</OriginalFile>\n")
    xmlDocument.WriteString("            <Description>Description</Description>\n")
    xmlDocument.WriteString("            <Uri>Uri</Uri>\n")
    xmlDocument.WriteString("        </Asset>\n")
    xmlDocument.WriteString("        <Asset id=\"2\" position=\"2\" type=\"2\" category=\"2\">\n")
    xmlDocument.WriteString("            <OriginalFile>OriginalFile</OriginalFile>\n")
    xmlDocument.WriteString("            <Description>Description</Description>\n")
    xmlDocument.WriteString("            <Uri>Uri</Uri>\n")
    xmlDocument.WriteString("        </Asset>\n")
    xmlDocument.WriteString("        <Asset id=\"3\" position=\"3\" type=\"3\" category=\"3\">\n")
    xmlDocument.WriteString("            <OriginalFile>OriginalFile</OriginalFile>\n")
    xmlDocument.WriteString("            <Description>Description</Description>\n")
    xmlDocument.WriteString("            <Uri>Uri</Uri>\n")
    xmlDocument.WriteString("        </Asset>\n")
    xmlDocument.WriteString("        <Asset id=\"4\" position=\"4\" type=\"4\" category=\"4\">\n")
    xmlDocument.WriteString("            <OriginalFile>OriginalFile</OriginalFile>\n")
    xmlDocument.WriteString("            <Description>Description</Description>\n")
    xmlDocument.WriteString("            <Uri>Uri</Uri>\n")
    xmlDocument.WriteString("        </Asset>\n")
    xmlDocument.WriteString("    </Assets>\n")
    xmlDocument.WriteString("    <Synonyms>\n")
    xmlDocument.WriteString("        <Synonym>Synonym 0</Synonym>\n")
    xmlDocument.WriteString("        <Synonym>Synonym 1</Synonym>\n")
    xmlDocument.WriteString("        <Synonym>Synonym 2</Synonym>\n")
    xmlDocument.WriteString("        <Synonym>Synonym 3</Synonym>\n")
    xmlDocument.WriteString("        <Synonym>Synonym 4</Synonym>\n")
    xmlDocument.WriteString("    </Synonyms>\n")
    xmlDocument.WriteString("</Node>\n")
}

接下来，我有创建示例文档并解码每个 '' 元素的应用程序：

func main() {
    nodeXMLDocumentBytes := factories.CreateNodeXMLDocumentBytes(100)

    xmlDocReader := bytes.NewReader(nodeXMLDocumentBytes)
    xmlDocDecoder := xml.NewDecoder(xmlDocReader)

    xmlDocNodeElementCounter := 0

    start := time.Now()

    for {
        token, _ := xmlDocDecoder.Token()
        if token == nil {
            break
        }

        switch element := token.(type) {
        case xml.StartElement:
            if element.Name.Local == "Node" {
                xmlDocNodeElementCounter++

                xmlDocDecoder.DecodeElement(new(entities.Node), &element)
            }
        }
    }

    fmt.Println("Total '<Node />' elements in the XML document: ", xmlDocNodeElementCounter)
    fmt.Printf("Total elapsed time: %v\n", time.Since(start))
}

这在我的机器上大约需要 11 毫秒。

接下来，我使用 goroutines 来解码 XML 个元素：

func main() {
    nodeXMLDocumentBytes := factories.CreateNodeXMLDocumentBytes(100)

    xmlDocReader := bytes.NewReader(nodeXMLDocumentBytes)
    xmlDocDecoder := xml.NewDecoder(xmlDocReader)

    xmlDocNodeElementCounter := 0

    start := time.Now()

    for {
        token, _ := xmlDocDecoder.Token()
        if token == nil {
            break
        }

        switch element := token.(type) {
        case xml.StartElement:
            if element.Name.Local == "Node" {
                xmlDocNodeElementCounter++

                go xmlDocDecoder.DecodeElement(new(entities.Node), &element)
            }
        }
    }

    time.Sleep(time.Second * 5)

    fmt.Println("Total '<Node />' elements in the XML document: ", xmlDocNodeElementCounter)
    fmt.Printf("Total elapsed time: %v\n", time.Since(start))
}

我使用一个简单的 'Sleep' 命令来确保 goroutines 完成。我知道它应该用通道和工作队列来实现。

根据我的控制台上的输出，只有 3 个元素被解码。那么其他元素发生了什么？也许与我使用流有关？

有什么方法可以使它并发，从而减少解码所有元素所需的时间？

Answer 1

您只有一个 xml.Decoder 对象。每次调用 xmlDocDecoder.Token() 时，它都会从（单个）输入流中读取下一个标记。在您的示例中，主循环和您启动的每个 goroutine 都试图同时读取相同的输入流，因此令牌流会随机分布在所有 goroutine 中。如果您再次运行可能会得到不同的结果；而且我有点惊讶，这种方法没有以某种奇怪的方式恐慌。

关于 XML 的一些事情使得这很难并行化。您实际需要在此处实现的顺序是：

注意 <Node> 开始元素事件。
向前阅读直到匹配的 </Node> 结束元素事件，在相同的深度，记住您在此期间通过的每个事件。
启动一个 goroutine 以将您记住的所有事件解组到一个结构中。

在实践中，"remember every event" 这一步可能与解组一样昂贵，而且整个序列将比磁盘或网络 I/O 读取文件中的文件快得多第一名。这看起来不像是可以很好地并行化的东西。

This takes around 11ms on my machine.

您做的还不够多，无法很好地判断它是 "fast" 还是 "slow"。查看 testing package for a better approach, plus the built-in profiling tools 中的基准测试支持。这会告诉您时间实际花在了哪里，并建议您可以改进什么。

如何使用 goroutines 解码 XML

How to decode XML using goroutines

xml

decode

go