使用 Javascript 从 pdf 中提取文本的特定部分？

Question

我需要做一个修改。我正在使用我发现的这段代码来提取 pdf 中的所有文本：

<!-- edit this; the PDF file must be on the same domain as this page -->
<iframe id="input" src="your-file.pdf"></iframe>

<!-- embed the pdftotext service as an iframe -->
<iframe id="processor" src="http://hubgit.github.com/2011/11/pdftotext/"></iframe>

<!-- a container for the output -->
<div id="output"></div>

<script>
var input = document.getElementById("input");
var processor = document.getElementById("processor");
var output = document.getElementById("output");

// listen for messages from the processor
window.addEventListener("message", function(event){
  if (event.source != processor.contentWindow) return;

  switch (event.data){
    // "ready" = the processor is ready, so fetch the PDF file
    case "ready":
      var xhr = new XMLHttpRequest;
      xhr.open('GET', input.getAttribute("src"), true);
      xhr.responseType = "arraybuffer";
      xhr.onload = function(event) {
        processor.contentWindow.postMessage(this.response, "*");
      };
      xhr.send();
    break;

    // anything else = the processor has returned the text of the PDF
    default:
      output.textContent = event.data.replace(/\s+/g, " ");
    break;
  }
}, true);
</script>

输出是没有任何段落的压缩文本。我所有的 pdf 都在开头某处有 'Datacover' 这个词，然后是一大段。

我想要做的就是删除从开头到单词 'Datacover' 的第一个实例以及单词 'Datacover' 前面的所有文本以显示所有文本，直到'的第三个实例。 ' <--（点 space）并删除所有下一个文本到最后。

你能帮忙吗？谢谢！

Answer 1

您可以匹配单词边界之间的 Datacover \b 并以非贪婪的方式重复匹配任何字符 3 次，包括新字符 [\s\S]*? 直到下一次出现点和 space \.

\bDatacover\b(?:[\s\S]*?\. ){3}

Regex demo

要获取数据，您可以使用

event.data.match(regex)

例如：

const regex = /\bDatacover\b(?:[\s\S]*?\. ){3}/g;
let event = {
  data: `testhjgjhg hjg jhg jkgh kjhghjkg76t 76 tguygtf yr 6 rt6 gtyut 67 tuy yoty yutyu tyu yutyuit iyut iuytiyu tuiyt Datacover uytuy tuyt uyt uiytuiyt uytutest.
yu tuyt uyt uyt iutiuyt uiy
 yuitui tuyt
test. 
 uiyt uiytuiyt
 uyt ut ui
this is a test. 
sjhdgfjsa. 
hgwryuehrgfhrghw fsdfdfsfs sddsfdfs.`
};

console.log(event.data.match(regex));

使用 Javascript 从 pdf 中提取文本的特定部分？

Extract specific portion of text from pdf using Javascript?

text

extract