我想在浏览器中使用扩展程序从 Google 文档中提取文本并保留语义换行符

Question

我有一个浏览器扩展（Firefox 和 Chrome），它的工作方式很像拼写检查器。从 input 和 textarea 甚至大多数 contenteditable 元素获取文本值时，它大多工作正常。但是 Google Docs 出于视觉原因喜欢插入 \n，这使得语义段落和句子具有挑战性。

例如正文：

A Long Heading That Visually Wraps With No Period On The End
 
A sentence that runs long enough that it visually wraps in Google Docs and ends up with extra line breaks. Another shorter sentence.

当从 Google 文档中提取时 DOM 和运行到 JSON.stringify 显示如下：

"\"A Long Heading That Visually Wraps \nWith No Period On The End \n  \nA sentence that runs long enough that it visually wraps in Google Docs and ends up with extra \nline breaks. Another shorter sentence.\""

注意 With 之前的 \n 不是语义的，然后是标题之后的 \n \n 是语义的，然后是 [=22= 之前的 \n ] 这又不是语义的。

在这种特定情况下，我可以 text.replace(/\n \n/g, '!!!').replace(/\n/g, '').replace(/!!!/g, '\n\n') 获得（更多）语义 body 文本。

但是，如果标题后没有双 \n 则不起作用。

你可以看到它有多脆弱。

是否有 JavaScript DOM/API 不需要额外授权的 Google 文档，以便我可以获得文档的干净文本？用户已经安装了这个扩展，并且还必须为他们的 Google 驱动器授权一个应用程序是不可行的。

或者是否有 JavaScript 句子分词器？否则，我将不得不使用 NTLK/spaCy 句子分词器将原始文本发送到 Python API 端点。

Answer 1

根据您要提取数据的文档是否为 public，您的应用程序可能需要或不需要授权才能提取干净数据。

无论哪种方式，使用 Document App of Apps Script or even the Documents API 都是获得干净 body 数据甚至 select 标题、副标题等提供更多功能的好选择，而不仅仅是提取文档文本数据。

NOTE: If you try to access a Document that is not public you will need to use oAuth 2.0. Since it isn't a public resource, you are required to use the credentials of an account that has access to this resource.

希望对您有所帮助。让我知道您是否需要其他任何东西或者您不明白什么。 :)

我想在浏览器中使用扩展程序从 Google 文档中提取文本并保留语义换行符

I want to extract the text from a Google Doc using an extension in the browser and preserve semantic line breaks

javascript

google-docs