将Word文档转换成HTML而不丢失原件

Question

我目前正在开发一个需要将 Word 文档显示为 HTML 的程序，但要跟踪 HTML 和原始文件的位置。

为此，最初加载 Word 文档时，会为文档中的每个元素生成 ID。

foreach (Table t in document.Tables)
{
    t.ID = GUID();

    Range range = t.Range;
    foreach (Cell c in range.Cells)
    {
        c.ID = t.ID + TableIDSeparator + GUID();
    }
}

foreach (Paragraph p in document.Paragraphs)
{
    p.ID = GUID();
}

然后我可以这样将文档另存为 HTML:

document.SaveAs2(tempFileName, WdSaveFormat.wdFormatFilteredHTML);

但是 document 对象变成了 HTML 文档，而不是原始的 Word 文档（就像从 Word 菜单中使用“另存为”时一样，当前 window 显示新的已保存文件而非原始文件）。

所以我尝试以这种方式将文档保存到 HTML：

Document temp = new Document();
string x = document.Range().XML;
temp.Range().InsertXML(x);
temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);
temp.Close(false);

但是现在新的temp文档丢失了我在原始文档中创建的所有ID，所以我无法根据原始文档找到HTML文件中的位置。

我是不是遗漏了一些重要的东西，或者有什么方法可以在不丢失对原始文件的引用的情况下另存为 word 文档？

Answer 1

由于文档完全相同，我使用以下方法将 ID 复制到新文档。

请注意Paragraphs/Tables/etc。数组从元素索引 1 开始，而不是 0。

        string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";

        Document temp = new Document();

        // Copy whole old document to new document
        temp.Range().InsertXML(doc.Range().XML);

        // copy IDs assuming the documents are identical and have same amount of elements
        for (int i = 1; i <= temp.Tables.Count; i++) {
            temp.Tables[i].ID = doc.Tables[i].ID;

            Range sRange = doc.Tables[i].Range;
            Range tRange = temp.Tables[i].Range;
            for(int j = 1; j <= tRange.Cells.Count; j++)
            {
                tRange.Cells[j].ID = sRange.Cells[j].ID;
            }
        }

        for(int i=1; i <= temp.Paragraphs.Count; i++)
        {
            temp.Paragraphs[i].ID = doc.Paragraphs[i].ID;
        }
        // Save new temp document as HTML
        temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);
        temp.Close();

        return fn;

因为我不需要输出的 DOCX 文件中的 ID（我只使用 ID 来跟踪加载到内存中的 DOCX 文件和它在我的应用程序中显示的 HTML 表示），这非常适合我的情况。

Answer 2

虽然上面的这种方法在大型文档上非常慢，所以我不得不换一种方式：

    public static string RenderHTMLFile(Document doc)
    {
        string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";

        var vba = doc.VBProject;
        var module = vba.VBComponents.Add(Microsoft.Vbe.Interop.vbext_ComponentType.vbext_ct_StdModule);

        var code = Properties.Resources.HTMLRenderer;
        module.CodeModule.AddFromString(code);

        var dataMacro = Word.Run("renderHTMLCopy", fn); 

        return fn;
    }

其中 Properties.Resources.HTMLRenderer 是一个包含以下 VB 代码的 txt 文件：

Sub renderHTMLCopy(ByVal path As String)
'
' renderHTMLCopy Macro
'
'
Selection.WholeStory
Selection.Copy
Documents.Add
Selection.PasteAndFormat wdPasteDefault
ActiveDocument.SaveAs2 path, WdSaveFormat.wdFormatFilteredHTML
ActiveDocument.Close False

End Sub

以前的版本对于一个小文档大约需要 1500 毫秒，而这个版本在大约 400 毫秒内呈现相同的文档！

将Word文档转换成HTML而不丢失原件

Convert Word document into HTML without losing original

c#

ms-word

office-interop