我如何使用 XDocument 往返实体化运输 return?

How do I round trip an entitized carriage return with XDocument?

假设我有这个 XML 文档:

<x xml:space='preserve'>&#xd;
</x>

以这个字节序列作为内容<x/>:

38 35 120 100 59 13 10

我对 W3C 规范的理解是序列 13 10 will be replaced before parsing. To get the sequence 13 10 to show up in my parsed tree, I have to include the character entity &xd; as clarified in a note in the W3C spec(我认识到这些来自 XML-1.1 而不是 XML-1.0,但它们在不描述不同行为的情况下澄清 XML-1.0 中令人困惑的事情。

As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

对于 XDocument.Parse,这一切似乎都能正常工作。上面XML的文本内容是13 10(而不是13 13 10),提示保留字符实体,将文字13 10替换为10在解析之前。

但是,我不知道如何在序列化时让 XDocument.ToString() 实体化换行符。也就是说,我希望 (XDocument xd) => XDocument.Parse($"{xd}") 是一个无损函数。但是,如果我传入一个 XDocument 实例并将 13 10 作为文本内容,该函数将输出一个 XDocument 实例并将 10 作为文本内容。请参阅此演示:

var x = XDocument.Parse("<x xml:space='preserve'>&#xd;\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", $"{x}"); // inside <x/>: 13 10, unexpected
x = XDocument.Parse($"{x}");
present("round tripped", x.Root.Value); // 10, unexpected

// Note that when formatting the version with just 10 in the value,
// we get Environment.NewLine in the formatted XML. So there is no
// way to differentiate between 10 and 13 10 with XDocument because
// it normalizes when serializing.
present("round tripped formatted", $"{x}"); // inside <x/>: 13 10, expected

void present(string label, string thing)
{
    Console.WriteLine(label);
    Console.WriteLine(thing);
    Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
    Console.WriteLine();
}

你可以看到当XDocument被序列化时,它无法将回车符return实体化为&#xd;&#10;。结果是它丢失了信息。我如何才能安全地对 XDocument 进行编码,这样我就不会丢失任何东西,尤其是我加载的原始文档中的回车符 return?

要往返 XDocument不要使用 recommended/easy serialization methods such as XDocument.ToString(),因为这是有损的。另请注意,即使您执行类似 xd.ToString(SaveOptions.DisableFormatting) 的操作,解析树中的任何回车符 returns 都将丢失 .

而是使用正确配置的 XmlWriterXDocument.WriteTo. If using an XmlWriter, the XmlWriter will be able to see that the document contained literal carriage returns and encode them correctly. To instruct it to do so, set XmlWritterSettings.NewLineHandling to NewLineHandling.Entitize。您可能希望编写一个扩展方法来使其更易于重用。

使用此方法修改后的演示如下:

var x = XDocument.Parse("<x xml:space='preserve'>&#xd;\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", toString(x)); // inside <x/>: 38 35 120 68 59 10 ("&#xD;\n"), acceptable
x = XDocument.Parse(toString(x));
present("round tripped", x.Root.Value); // 13 10, expected

string toString(XDocument xd)
{
    using var sw = new StringWriter();
    using (var writer = XmlWriter.Create(sw, new XmlWriterSettings
    {
        NewLineHandling = NewLineHandling.Entitize,
    }))
    {
        xd.WriteTo(writer);
    }
    return sw.ToString();
}

void present(string label, string thing)
{
    Console.WriteLine(label);
    Console.WriteLine(thing);
    Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
    Console.WriteLine();
}