将 XML 解析为 R

Parse XML into R

我正在尝试将 XML 解析为 R,但出现此错误:

Entity `thinsp` not defined

我找到了 &thinsp 实体,但我不知道如何处理它。 我将衷心感谢您的帮助。我尝试了以下方法:

file1 <- xmlTreeParse("1496019.xml",useInternalNodes = TRUE)
file2 <- xmlParse("1496019.xml",useInternalNodes = TRUE)

请在下面找到示例代码

<!DOCTYPE om  PUBLIC "" "sm.dtd"><servinfo>
<servinfosub>
<title>Circuit Description</title>
<ptxt>The commanded throttle position (TP) is compared to the actual TP.</ptxt>
</servinfosub>
<servinfosub>
<title>DTC Descriptor</title>
<ptxt>This diagnostic procedure supports the following DTC:</ptxt>
<ptxt>DTC&thinsp;P2101 Throttle Actuator Position Performance</ptxt>
</servinfosub>
<servinfosub>
<title>Diagnostic Aids</title>
<list1 type="unordered-bullet">
<item><ptxt>The throttle valve should be open approximately 20&thinsp;percent. </ptxt></item>
<item><ptxt>If the throttle blade becomes stuck, DTC&thinsp;P1516 and/or P2119 will set. </ptxt></item>
<item>
<important><title>Important</title><ptxt> this function.</ptxt></important>
<ptxt>The scan tool has the ability to operate the throttle control system using Special Functions. </ptxt></item>
<item><ptxt>Inspect for the following conditions:</ptxt></item>
<list2 type="unordered-dash">
<item><ptxt>Use the  <object-link object-id="8917"/> Connector Test Adapter Kit for any test that requires probing the PCM harness connector or a component harness connector.</ptxt></item>
<item><ptxt>Poor connections at the PCM or at the component—Inspect the harness connectors for a poor terminal to wire connection. Refer to  <cell-link cell-id="62112"/> for the proper procedure.</ptxt></item>
<item><ptxt>For intermittents, refer to  <cell-link cell-id="81512"/>.</ptxt></item>
</list2>
</list1>
</servinfosub>
</servinfo>

解决这个问题的一种方法是预处理文档并替换未知实体:

library(XML)
txt <- '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><entry>abc&thinsp;</entry>'
xml <- xmlParse(txt, asText = TRUE)
# Error: 1: Entity 'thinsp' not defined

txt <- gsub("&thinsp;", "", txt, fixed = TRUE)
(xml <- xmlParse(txt, asText = TRUE))
# <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
# <entry>abc</entry>