XSLT 2.0:创建 RegEx 以从连续文本节点枚举章节编号和描述
XSLT 2.0: Create RegEx to enumerate chapter numbers and description from continous text nodes
我喜欢从 XML 文件中提取章节编号、标题和描述到 XML element/attribute 层次结构。它们分布在不同元素的连续文本中。 XML 看起来像这样:
<?xml version="1.0" encoding="utf-8"?>
<root>
<cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.
</cell>
<cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.
</cell>
</root>
所需的输出应如下所示:
<?xml version="1.0" encoding="utf-8"?>
<Root>
<Desc chapter="3.1.1.17" title="First Section">The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</Desc>
<Desc chapter="3.1.1.18" title="Second Section">This section lists things that occur under certain conditions.</Desc>
<Desc chapter="3.1.1.19" title="Third Section">This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</Desc>
</Root>
到目前为止我的 XSLT 是:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml" encoding="utf-8" />
<xsl:template match="text()" />
<xsl:template match="/root">
<Root>
<xsl:apply-templates select="cell" />
</Root>
</xsl:template>
<xsl:template match="cell">
<xsl:variable name="sections" as="element(Desc)*">
<xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)(.*?)" select="text()">
<xsl:matching-substring>
<Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
<xsl:value-of select="regex-group(3)" />
</Desc>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each select="$sections">
<xsl:copy-of select="." />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
问题出在 RegEx 的最后一部分:(.*?)
- 一个 non-greedy 消耗表达式。不幸的是我不能让它停在正确的位置。我尝试使用 ?:
和 (?=...)
使其在下一个 \d+\.\d+\.\d+\.\d+\.
之前停止 non-consuming,但 XSLT-2.0 的 RegEx 语法似乎与其他方言有些不同。
如何提取相关部分以便在 for-each
中方便地处理它们作为 regex-group(1..3)
?
此外,我对所有 RegEx-tokens.
的相当完整的 XSLT-2.0 参考感兴趣
好像
<xsl:template match="cell">
<xsl:variable name="sections">
<xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)" select=".">
<xsl:matching-substring>
<xsl:message select="concat('|', regex-group(3), '|')"/>
<Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
<xsl:value-of select="regex-group(3)" />
</Desc>
</xsl:matching-substring>
<xsl:non-matching-substring>
<Value>
<xsl:value-of select="."/>
</Value>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each select="$sections/Desc">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:value-of select="following-sibling::Value[1]"/>
</xsl:copy>
</xsl:for-each>
</xsl:template>
捕获您想要 select 和尾随文本的数据。
抱歉,我必须用 JS 回复,但我相信您可以简单地弄清楚发生了什么。您的正则表达式和替换解决方案应该是这样的;
var xmlData = '<?xml version="1.0" encoding="utf-8"?>\n<root>\n <cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.\n </cell>\n <cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.\n </cell>\n</root>',
rex = /<cell>(?:\s*(\d+.\d+.\d+.\d+)\s+(\w+)\s+Section)(.+)\n*\s*<\/cell>/gm,
xml = xmlData.replace(rex,'<Desc chapter="" title=" Section"></desc>');
console.log(xmlData);
<?xml version="1.0" encoding="utf-8"?>
<root>
<Desc chapter="3.1.1.17" title="First Section"> The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</desc>
<Desc chapter="3.1.1.18" title="Second Section"> This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</desc>
</root>
我喜欢从 XML 文件中提取章节编号、标题和描述到 XML element/attribute 层次结构。它们分布在不同元素的连续文本中。 XML 看起来像这样:
<?xml version="1.0" encoding="utf-8"?>
<root>
<cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.
</cell>
<cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.
</cell>
</root>
所需的输出应如下所示:
<?xml version="1.0" encoding="utf-8"?>
<Root>
<Desc chapter="3.1.1.17" title="First Section">The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</Desc>
<Desc chapter="3.1.1.18" title="Second Section">This section lists things that occur under certain conditions.</Desc>
<Desc chapter="3.1.1.19" title="Third Section">This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</Desc>
</Root>
到目前为止我的 XSLT 是:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml" encoding="utf-8" />
<xsl:template match="text()" />
<xsl:template match="/root">
<Root>
<xsl:apply-templates select="cell" />
</Root>
</xsl:template>
<xsl:template match="cell">
<xsl:variable name="sections" as="element(Desc)*">
<xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)(.*?)" select="text()">
<xsl:matching-substring>
<Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
<xsl:value-of select="regex-group(3)" />
</Desc>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each select="$sections">
<xsl:copy-of select="." />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
问题出在 RegEx 的最后一部分:(.*?)
- 一个 non-greedy 消耗表达式。不幸的是我不能让它停在正确的位置。我尝试使用 ?:
和 (?=...)
使其在下一个 \d+\.\d+\.\d+\.\d+\.
之前停止 non-consuming,但 XSLT-2.0 的 RegEx 语法似乎与其他方言有些不同。
如何提取相关部分以便在 for-each
中方便地处理它们作为 regex-group(1..3)
?
此外,我对所有 RegEx-tokens.
的相当完整的 XSLT-2.0 参考感兴趣好像
<xsl:template match="cell">
<xsl:variable name="sections">
<xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)" select=".">
<xsl:matching-substring>
<xsl:message select="concat('|', regex-group(3), '|')"/>
<Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
<xsl:value-of select="regex-group(3)" />
</Desc>
</xsl:matching-substring>
<xsl:non-matching-substring>
<Value>
<xsl:value-of select="."/>
</Value>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each select="$sections/Desc">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:value-of select="following-sibling::Value[1]"/>
</xsl:copy>
</xsl:for-each>
</xsl:template>
捕获您想要 select 和尾随文本的数据。
抱歉,我必须用 JS 回复,但我相信您可以简单地弄清楚发生了什么。您的正则表达式和替换解决方案应该是这样的;
var xmlData = '<?xml version="1.0" encoding="utf-8"?>\n<root>\n <cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.\n </cell>\n <cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.\n </cell>\n</root>',
rex = /<cell>(?:\s*(\d+.\d+.\d+.\d+)\s+(\w+)\s+Section)(.+)\n*\s*<\/cell>/gm,
xml = xmlData.replace(rex,'<Desc chapter="" title=" Section"></desc>');
console.log(xmlData);
<?xml version="1.0" encoding="utf-8"?>
<root>
<Desc chapter="3.1.1.17" title="First Section"> The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</desc>
<Desc chapter="3.1.1.18" title="Second Section"> This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</desc>
</root>