我如何使用 XDocument 往返实体化运输 return?
How do I round trip an entitized carriage return with XDocument?
假设我有这个 XML 文档:
<x xml:space='preserve'>
</x>
以这个字节序列作为内容<x/>
:
38 35 120 100 59 13 10
我对 W3C 规范的理解是序列 13 10
will be replaced before parsing. To get the sequence 13 10
to show up in my parsed tree, I have to include the character entity &xd;
as clarified in a note in the W3C spec(我认识到这些来自 XML-1.1 而不是 XML-1.0,但它们在不描述不同行为的情况下澄清 XML-1.0 中令人困惑的事情。
As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
对于 XDocument.Parse
,这一切似乎都能正常工作。上面XML的文本内容是13 10
(而不是13 13 10
),提示保留字符实体,将文字13 10
替换为10
在解析之前。
但是,我不知道如何在序列化时让 XDocument.ToString()
实体化换行符。也就是说,我希望 (XDocument xd) => XDocument.Parse($"{xd}")
是一个无损函数。但是,如果我传入一个 XDocument
实例并将 13 10
作为文本内容,该函数将输出一个 XDocument
实例并将 10
作为文本内容。请参阅此演示:
var x = XDocument.Parse("<x xml:space='preserve'>
\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", $"{x}"); // inside <x/>: 13 10, unexpected
x = XDocument.Parse($"{x}");
present("round tripped", x.Root.Value); // 10, unexpected
// Note that when formatting the version with just 10 in the value,
// we get Environment.NewLine in the formatted XML. So there is no
// way to differentiate between 10 and 13 10 with XDocument because
// it normalizes when serializing.
present("round tripped formatted", $"{x}"); // inside <x/>: 13 10, expected
void present(string label, string thing)
{
Console.WriteLine(label);
Console.WriteLine(thing);
Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
Console.WriteLine();
}
你可以看到当XDocument
被序列化时,它无法将回车符return实体化为
或
。结果是它丢失了信息。我如何才能安全地对 XDocument
进行编码,这样我就不会丢失任何东西,尤其是我加载的原始文档中的回车符 return?
要往返 XDocument
,不要使用 recommended/easy serialization methods such as XDocument.ToString()
,因为这是有损的。另请注意,即使您执行类似 xd.ToString(SaveOptions.DisableFormatting)
的操作,解析树中的任何回车符 returns 都将丢失 .
而是使用正确配置的 XmlWriter
和 XDocument.WriteTo
. If using an XmlWriter
, the XmlWriter
will be able to see that the document contained literal carriage returns and encode them correctly. To instruct it to do so, set XmlWritterSettings.NewLineHandling
to NewLineHandling.Entitize
。您可能希望编写一个扩展方法来使其更易于重用。
使用此方法修改后的演示如下:
var x = XDocument.Parse("<x xml:space='preserve'>
\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", toString(x)); // inside <x/>: 38 35 120 68 59 10 ("
\n"), acceptable
x = XDocument.Parse(toString(x));
present("round tripped", x.Root.Value); // 13 10, expected
string toString(XDocument xd)
{
using var sw = new StringWriter();
using (var writer = XmlWriter.Create(sw, new XmlWriterSettings
{
NewLineHandling = NewLineHandling.Entitize,
}))
{
xd.WriteTo(writer);
}
return sw.ToString();
}
void present(string label, string thing)
{
Console.WriteLine(label);
Console.WriteLine(thing);
Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
Console.WriteLine();
}
假设我有这个 XML 文档:
<x xml:space='preserve'>
</x>
以这个字节序列作为内容<x/>
:
38 35 120 100 59 13 10
我对 W3C 规范的理解是序列 13 10
will be replaced before parsing. To get the sequence 13 10
to show up in my parsed tree, I have to include the character entity &xd;
as clarified in a note in the W3C spec(我认识到这些来自 XML-1.1 而不是 XML-1.0,但它们在不描述不同行为的情况下澄清 XML-1.0 中令人困惑的事情。
As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
对于 XDocument.Parse
,这一切似乎都能正常工作。上面XML的文本内容是13 10
(而不是13 13 10
),提示保留字符实体,将文字13 10
替换为10
在解析之前。
但是,我不知道如何在序列化时让 XDocument.ToString()
实体化换行符。也就是说,我希望 (XDocument xd) => XDocument.Parse($"{xd}")
是一个无损函数。但是,如果我传入一个 XDocument
实例并将 13 10
作为文本内容,该函数将输出一个 XDocument
实例并将 10
作为文本内容。请参阅此演示:
var x = XDocument.Parse("<x xml:space='preserve'>
\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", $"{x}"); // inside <x/>: 13 10, unexpected
x = XDocument.Parse($"{x}");
present("round tripped", x.Root.Value); // 10, unexpected
// Note that when formatting the version with just 10 in the value,
// we get Environment.NewLine in the formatted XML. So there is no
// way to differentiate between 10 and 13 10 with XDocument because
// it normalizes when serializing.
present("round tripped formatted", $"{x}"); // inside <x/>: 13 10, expected
void present(string label, string thing)
{
Console.WriteLine(label);
Console.WriteLine(thing);
Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
Console.WriteLine();
}
你可以看到当XDocument
被序列化时,它无法将回车符return实体化为
或
。结果是它丢失了信息。我如何才能安全地对 XDocument
进行编码,这样我就不会丢失任何东西,尤其是我加载的原始文档中的回车符 return?
要往返 XDocument
,不要使用 recommended/easy serialization methods such as XDocument.ToString()
,因为这是有损的。另请注意,即使您执行类似 xd.ToString(SaveOptions.DisableFormatting)
的操作,解析树中的任何回车符 returns 都将丢失 .
而是使用正确配置的 XmlWriter
和 XDocument.WriteTo
. If using an XmlWriter
, the XmlWriter
will be able to see that the document contained literal carriage returns and encode them correctly. To instruct it to do so, set XmlWritterSettings.NewLineHandling
to NewLineHandling.Entitize
。您可能希望编写一个扩展方法来使其更易于重用。
使用此方法修改后的演示如下:
var x = XDocument.Parse("<x xml:space='preserve'>
\r\n</x>");
present("content", x.Root.Value); // 13 10, expected
present("formatted", toString(x)); // inside <x/>: 38 35 120 68 59 10 ("
\n"), acceptable
x = XDocument.Parse(toString(x));
present("round tripped", x.Root.Value); // 13 10, expected
string toString(XDocument xd)
{
using var sw = new StringWriter();
using (var writer = XmlWriter.Create(sw, new XmlWriterSettings
{
NewLineHandling = NewLineHandling.Entitize,
}))
{
xd.WriteTo(writer);
}
return sw.ToString();
}
void present(string label, string thing)
{
Console.WriteLine(label);
Console.WriteLine(thing);
Console.WriteLine(string.Join(" ", Encoding.UTF8.GetBytes(thing)));
Console.WriteLine();
}