使用 .NET (XSL 2.0) 中的 Saxon 库将 HTML 转换为文本

Transforming HTML to Text with the Saxon library in .NET (XSL 2.0)

我正在尝试将 HTML 标记转换为文本

我正在使用 Saxon 库,因为 .NET 4.5 本身不支持 XSL 2.0。 http://saxon.sourceforge.net/#F9.7HE

当我 运行 我的 xsl 脚本在 http://xslttest.appspot.com/ 上时,我没有收到任何错误并且输出是正确的。

HTML代码:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
    <head> 
        <title>Test Title</title>
    </head>
    <body>
    <h1>Test Header</h1>
    <p>Blah Blah Blah</p>
        <p class="center"><img src="ignore.jpeg" alt="ignore"/></p>
    <div class="Test"><p>More Text</p></div>
    </body>
</html>

XSLT:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xsl:output method="text" media-type="text"/>

    <xsl:template match="/xhtml:html">
        <xsl:call-template name="print-it">
            <xsl:with-param name="nodeToPrint" select="xhtml:body"/>
        </xsl:call-template>
    </xsl:template>

    <xsl:template name="print-it">
        <xsl:param name="nodeToPrint"/>
        <xsl:for-each select="child::*">
            <xsl:choose>
                <xsl:when test="matches(lower-case(local-name(.)), 'h[123456]|p|div|title')">
                    <xsl:value-of select="concat(normalize-space(replace(string-join(text(), ''), '''', '')), ' ')"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="normalize-space(replace(string-join(text(), ''), '''', ''))"/>
                </xsl:otherwise>
            </xsl:choose>
            <xsl:call-template name="print-it">
                <xsl:with-param name="nodeToPrint" select="."/>
            </xsl:call-template>
        </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

输出:

    Test Title


Test Header
Blah Blah Blah

More Text

但是,当我尝试在 .NET 中进行转换时,出现异常。我不确定问题是出在 XSL 脚本上,在线转换器是宽容的,还是撒克逊图书馆出了问题。

异常消息:

Exception thrown: 'System.InvalidOperationException' in saxon9he.dll

Additional information: The specified node cannot be inserted as the valid child of this node, because the specified node is the wrong type.

.NET代码:

using Saxon.Api;

var xslt = new FileInfo(@"C:\path\to\stylesheet.xslt");
var input = new FileInfo(@"C:\path\to\data.xml");
var output = new FileInfo(@"C:\path\to\result.xml");

// Compile stylesheet
var processor = new Processor();
var compiler = processor.NewXsltCompiler();
var executable = compiler.Compile(new Uri(xslt.FullName));

// Do transformation to a destination
var destination = new DomDestination();
using(var inputStream = input.OpenRead())
{
    var transformer = executable.Load();
    transformer.SetInputStream(inputStream, new Uri(input.DirectoryName));
    transformer.Run(destination);
}

// Save result to a file (or whatever else you wanna do)
destination.XmlDocument.Save(output.FullName);

更新

谢谢 MartinHonnen。你的建议奏效了。

Serializer _serializer = new Serializer();
MemoryStream _ms = new MemoryStream();
String _outputStream = new StreamWriter(_ms, new UTF8Encoding(false));
 _serializer.SetOutputWriter(_outputStream);

using (inputStream == input.OpenRead()) {
    XsltTransformer transformer = executable.Load();
    transformer.MessageListener = new SaxtonMessageListener();
    transformer.SetInputStream(inputStream, new Uri(input.DirectoryName));
    transformer.Run(_serializer);
}

String _text = Encoding.UTF8.GetString(_ms.ToArray());

如果你只是想要一个字符串或者一个文本文件,那么你可以使用http://saxonica.com/html/documentation/dotnetdoc/Saxon/Api/Serializer.html and either set http://saxonica.com/html/documentation/dotnetdoc/Saxon/Api/Serializer.html#SetOutputFile(string) if you want a file or create a StringWriter and pass it to http://saxonica.com/html/documentation/dotnetdoc/Saxon/Api/Serializer.html#SetOutputWriter(System.IO.TextWriter),那么在Run调用之后你只需要在创建的StringWriter上使用ToString() I想想。