xml 转义特殊字符
xml escape special characters
创建包含以下内容的文件:
<xml>yen symbol - ¥</xml>
在 firefox 中打开文件,出现此错误:
XML Parsing Error: not well-formed
Location: file:///test.xml
Line Number 1, Column 19:<xml>yen symbol - </xml>
------------------^
如何转义 XML 中的特殊字符?
注意:我正在使用 .Net XmlDocument.OuterXML 属性 来检索 XML。出于某种原因,.net 不会自动转义日元字符。
更新:我遇到的真正问题是我通过代码在 .net 中构建 xml,然后通过 http 将 xml 推送到 Solr。 Java solr 中的代码中断,因为它认为日元字符格式不正确 xml。我将编码设置为UTF-8。
Public Shared Sub UpdateRecords(p_SolrRecordCollection As SolrRecordCollection, Optional commit As Boolean = True, Optional optimize As Boolean = True)
Try
Dim webClientInstance As New WebClient()
webClientInstance.Headers.Add("Content-Type", "text/xml")
webClientInstance.Encoding = System.Text.Encoding.UTF8
Dim xml = p_SolrRecordCollection.XmlDocument.OuterXml
Dim params As String = String.Format("?commit={0}&optimize={1}", commit.ToString.ToLower, optimize.ToString.ToLower)
webClientInstance.UploadString(SolrURL + UpdateRelativeURL + params, xml)
Catch ex As WebException
Dim responseText As String = String.Empty
If ex.Response IsNot Nothing Then
responseText = " :" & ControlChars.NewLine
Using reader = New StreamReader(ex.Response.GetResponseStream())
responseText = reader.ReadToEnd()
End Using
End If
Throw New Exception("Request to Solr failed" & responseText, ex)
End Try
End Sub
这是Solr报错
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">135</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]
at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
at org.apache.solr.handler.UpdateRequestHandler.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]
at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 36 more
</str><int name="code">500</int></lst>
</response>
确保您使用能够正确处理日元字符并且能够被 Firefox 识别的编码保存文件,例如UTF-8。 (在我看来似乎 Firefox 在没有指定其他内容的情况下期待 Unicode,但我没有验证这一点。)那么就没有必要转义那个字符了。
更好的是,添加一个标明所用编码的标题:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>yen symbol - ¥</xml>
您正在创建的文件未保存为 UTF-8;可能是 ASCI。您可以通过打开它并使用记事本或任何其他可以以 UTF-8 编码保存文件的文本编辑工具来证明这一点。在记事本中,当您 "Save as..." 您有一个用于编码的选项下拉框。默认显示文件已经使用的编码。
你根本不需要转义日元字符。如果文件被转换为 UTF-8,firefox 或任何 XML 解释器应该没有问题。
你的错误消息让我相信日元字符是一个转移话题。
expansion character (code 0xb) not a valid XML character
这是 UTF-8 中的垂直制表符。听起来编码转换中有一些损坏。我不确定您的 SolrRecordCollection 对象返回的是什么编码,但我猜它是 UTF-8。如果可以,找出 XmlDocument 方法返回的编码。
WebClient.UploadString Method进行编码转换:
Before uploading the string, this method converts it to a Byte array
using the encoding specified in the Encoding property.
所以我猜可能发生的事情是它试图获取一个 UTF-8 字符串并将其解释为标准的 .NET UTF-16 字符串,然后将这个被误解的数据转换为 UTF-8。我认为如果在将 XML 字符串变量发送到方法之前将其转换为 UTF-16,它可能会解决您的问题。这是一个回答如何做到这一点的问题:
How do you convert an xml string with UTF-8 encoding UTF-16?
仅供参考,本文通俗易懂,有助于理解文本编码:
我走了这条路:我使用 JSON 重新编码了我的上传逻辑。我使用 Newtonsoft 的 Json 库处理所有 json 转义。我知道这不是解决问题的正确方法,但这是解决我经历的所有 XML 噩梦的有效解决方案。
参考:
创建包含以下内容的文件:
<xml>yen symbol - ¥</xml>
在 firefox 中打开文件,出现此错误:
XML Parsing Error: not well-formed
Location: file:///test.xml
Line Number 1, Column 19:<xml>yen symbol - </xml>
------------------^
如何转义 XML 中的特殊字符?
注意:我正在使用 .Net XmlDocument.OuterXML 属性 来检索 XML。出于某种原因,.net 不会自动转义日元字符。
更新:我遇到的真正问题是我通过代码在 .net 中构建 xml,然后通过 http 将 xml 推送到 Solr。 Java solr 中的代码中断,因为它认为日元字符格式不正确 xml。我将编码设置为UTF-8。
Public Shared Sub UpdateRecords(p_SolrRecordCollection As SolrRecordCollection, Optional commit As Boolean = True, Optional optimize As Boolean = True)
Try
Dim webClientInstance As New WebClient()
webClientInstance.Headers.Add("Content-Type", "text/xml")
webClientInstance.Encoding = System.Text.Encoding.UTF8
Dim xml = p_SolrRecordCollection.XmlDocument.OuterXml
Dim params As String = String.Format("?commit={0}&optimize={1}", commit.ToString.ToLower, optimize.ToString.ToLower)
webClientInstance.UploadString(SolrURL + UpdateRelativeURL + params, xml)
Catch ex As WebException
Dim responseText As String = String.Empty
If ex.Response IsNot Nothing Then
responseText = " :" & ControlChars.NewLine
Using reader = New StreamReader(ex.Response.GetResponseStream())
responseText = reader.ReadToEnd()
End Using
End If
Throw New Exception("Request to Solr failed" & responseText, ex)
End Try
End Sub
这是Solr报错
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">135</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]
at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
at org.apache.solr.handler.UpdateRequestHandler.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
at [row,col {unknown-source}]: [827,871]
at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 36 more
</str><int name="code">500</int></lst>
</response>
确保您使用能够正确处理日元字符并且能够被 Firefox 识别的编码保存文件,例如UTF-8。 (在我看来似乎 Firefox 在没有指定其他内容的情况下期待 Unicode,但我没有验证这一点。)那么就没有必要转义那个字符了。
更好的是,添加一个标明所用编码的标题:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>yen symbol - ¥</xml>
您正在创建的文件未保存为 UTF-8;可能是 ASCI。您可以通过打开它并使用记事本或任何其他可以以 UTF-8 编码保存文件的文本编辑工具来证明这一点。在记事本中,当您 "Save as..." 您有一个用于编码的选项下拉框。默认显示文件已经使用的编码。
你根本不需要转义日元字符。如果文件被转换为 UTF-8,firefox 或任何 XML 解释器应该没有问题。
你的错误消息让我相信日元字符是一个转移话题。
expansion character (code 0xb) not a valid XML character
这是 UTF-8 中的垂直制表符。听起来编码转换中有一些损坏。我不确定您的 SolrRecordCollection 对象返回的是什么编码,但我猜它是 UTF-8。如果可以,找出 XmlDocument 方法返回的编码。
WebClient.UploadString Method进行编码转换:
Before uploading the string, this method converts it to a Byte array using the encoding specified in the Encoding property.
所以我猜可能发生的事情是它试图获取一个 UTF-8 字符串并将其解释为标准的 .NET UTF-16 字符串,然后将这个被误解的数据转换为 UTF-8。我认为如果在将 XML 字符串变量发送到方法之前将其转换为 UTF-16,它可能会解决您的问题。这是一个回答如何做到这一点的问题:
How do you convert an xml string with UTF-8 encoding UTF-16?
仅供参考,本文通俗易懂,有助于理解文本编码:
我走了这条路:我使用 JSON 重新编码了我的上传逻辑。我使用 Newtonsoft 的 Json 库处理所有 json 转义。我知道这不是解决问题的正确方法,但这是解决我经历的所有 XML 噩梦的有效解决方案。
参考: