XPath、text() 和 sum() 的 XSLT 困境或乐趣，三幕悲剧

Question

我想对 html 个文件进行基本的字数统计，排除一些不应该包含的元素。示例文件可能如下所示：

<?xml version='1.0'?>
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:epub='http://www.idpf.org/2007/ops'>
    <head>
        <meta charset='utf-8' />
        <link rel='stylesheet' type='text/css' href='standard.css' />
        <title>Book title</title>
    </head>
    <body class='contents'>
        <h3>Chapter 1</h3>
        <p>Lorem ipsum dolor sit amet</p>
        <p>consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore</p>
        <blockquote>et dolore magna</blockquote>
        <p>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi</p>
    </body>
</html>

它应该忽略不在 body 中的所有内容（通常这只意味着排除 <title/>），但也有一些我想在 body 中排除的标签以及。主要是前导 headers。允许的标签数量足够少（我认为），如果我将允许的标签列入白名单或将不允许的标签列入黑名单对我来说并不重要。

我正在为 Windows 使用一个 command-line 工具，我不记得我从哪里得到它了，叫做 xsltproc。它声称这些版本：

xsltproc --version
Using libxml 20708, libxslt 10126 and libexslt 815
xsltproc was compiled against libxml 20706, libxslt 10126 and libexslt 815
libxslt 10126 was compiled against libxml 20706
libexslt 815 was compiled against libxml 20706

我不知道它是否支持 XSLT 2+，我只熟悉 XSLT 1.0。每次我不得不处理它时，我似乎都必须重新学习。

对于我的 html 文件中任何给定的 p|blockquote|cite 元素，我可以用类似这样的表达式（无耻地从另一个 SO post):

<xsl:value-of select="string-length(normalize-space(.))
-
 string-length(translate(normalize-space(.),' ','')) +1
 "/>

将有效的（命名空间，呃！）XPath 替换到 normalize-space() 的作品中，通常是某种形式 //ns:p/text().

但这只会让我得到每个 <p> 的计数。我也可以将其放入 xsl for-each 中，并获得每个 per-paragraph 计数的长长列表。但我真正想要的是一个总计……并且看到有一个 sum()，它应该很容易。但我不知何故搞砸了。

<?xml version='1.0' encoding='utf-8' ?>
<xsl:stylesheet version='1.0' 
                xmlns:h='http://www.w3.org/1999/xhtml'
                xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
    <xsl:strip-space elements="*"/>
    <xsl:output method="xml" indent="yes" encoding="utf-8" standalone="no" cdata-section-elements="style"/>
    <xsl:template match='/'>
        <a>
        <q><xsl:value-of select="//h:body//text()"/></q>
        <b>
            <xsl:value-of select="string-length(normalize-space(//h:p/text())) - string-length(translate(normalize-space(//h:p/text()), ' ', '')) + 1"/>
        </b>
        </a>
    </xsl:template>
</xsl:stylesheet>

以上至少忽略了h3标签。但它将它限制在第一个段落而不是其他段落，给出的结果是 5 而不是预期的 27。返回使用点 xpath 似乎获得了多个标签的文本节点价值，但给出了异常值 29（应该是34？或 32 如果只是以某种方式获得 body 元素的 children？）。

有没有办法获得合理的价值，或者这些只是错误的工具？

Answer 1

libxslt 处理器仅支持 XSLT 1.0。在 XSLT 1.0 中，无法直接对计算值求和。

OTOH，处理器支持许多扩展功能-例如。 EXSLT str:tokenize() 函数使生成字数统计变得更加容易。

试试这样的东西：

<xsl:stylesheet version="1.0"   
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:x="http://www.w3.org/1999/xhtml"
xmlns:exsl="http://exslt.org/common"
xmlns:str="http://exslt.org/strings"
exclude-result-prefixes="x exsl str">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:template match="/x:html">
    <xsl:variable name="word-counts">
        <xsl:for-each select="x:body//*[not(starts-with(name(), 'h'))]/text()">
            <n>
                <xsl:value-of select="count(str:tokenize(., ' '))"/>
            </n>
        </xsl:for-each>
    </xsl:variable>
    <output>
        <xsl:value-of select="sum(exsl:node-set($word-counts)/n)"/>
    </output>
</xsl:template>

</xsl:stylesheet>

Answer 2

如果您运行使用 Windows 中的命令行，那么您就不会受限于仅实现 XSLT 1.0 的古老 xsltproc/libxslt。例如，您可以使用实现 XSLT 3.0 的 Saxon。然后你可以这样做：

<xsl:template match="/x:html">
    <out>{count(//*[f:is-included(.)]/text()/tokenize())}</out>
</xsl:template>

<xsl:function name="f:is-included" as="xs:boolean">
  <xsl:param name="e" as="element()"/>
  <xsl:sequence select="exists($e[self::p or self::q or self::r...])"/>
</xsl:function>

XPath、text() 和 sum() 的 XSLT 困境或乐趣，三幕悲剧

XSLT woes, or fun with XPath, text(), and sum(), a tragedy in three acts

xslt

xslt-1.0