如何在 Stanford CoreNLP 的输出中保留原始行编号？

Question

文本语料库通常作为大文件分发，每行包含特定文档。例如，我有一个包含 1000 万条产品评论的文件，每行一条，每条评论包含多个句子。

在使用 Stanford CoreNLP 处理此类文件时，例如使用命令行

java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma -file test.txt

输出，无论是文本格式还是 xml 格式，都会对所有句子进行编号，从 1 到 n，忽略分隔文档的原始行编号。

我想跟踪原始文件的行号（例如，在 xml 格式中，有一个像 <original_line id=1>，然后 <sentence id=1>，然后 [=14] 这样的输出树=]).或者，能够在原始文件中的每个新行的开头重置句子编号。

我试过 similar question 关于斯坦福词性标注器的答案，但没有成功。这些选项不会跟踪原始行号。

一个快速的解决方案可能是将原始文件拆分为多个文件，然后使用 CoreNLP 和 -filelist 输入选项处理每个文件。但是，对于包含数百万文档的大文件，创建数百万个单独的文件只是为了保留原始 line/document 编号似乎效率低下。

我想可以修改 Stanford CoreNLP 的源代码，但我不熟悉 Java。

任何在输出中保留原始行编号的解决方案都会非常有帮助，无论是通过命令行还是通过显示示例 Java 代码来实现。

Answer 1

我翻遍了代码库，但找不到对您有帮助的命令行标志。

我写了一些示例 Java 代码应该可以解决问题。

我把它放在 DocPerLineProcessor.java 中，我把它放在 stanford-corenlp-full-2015-04-20 中。我还放了一个名为 sample-doc-per-line.txt 的文件，每行有 4 个句子。

首先确保编译：

cd stanford-corenlp-full-2015-04-20

javac -cp "*:." DocPerLineProcessor.java

这是运行的命令：

java -cp "*:." DocPerLineProcessor sample-doc-per-line.txt

输出 sample-doc-per-line.txt.xml 应该是所需的 xml 格式，但句子现在有它们所在的行号。

代码如下：

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*; 
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class DocPerLineProcessor {
    public static void main (String[] args) throws IOException {
        // set up properties
        Properties props = new Properties();
        props.setProperty("annotators",
            "tokenize, ssplit, pos, lemma, ner, parse");
        // set up pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // read in a product review per line
        Iterable<String> lines = IOUtils.readLines(args[0]);
        Annotation mainAnnotation = new Annotation("");
        // add a blank list to put sentences into
        List<CoreMap> blankSentencesList = new ArrayList<CoreMap>();
        mainAnnotation.set(CoreAnnotations.SentencesAnnotation.class,blankSentencesList);
        // process each product review
        int lineNumber = 1;
        for (String line : lines) {
            Annotation annotation = new Annotation(line);
            pipeline.annotate(annotation);
            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
                sentence.set(CoreAnnotations.LineNumberAnnotation.class,lineNumber);
                mainAnnotation.get(CoreAnnotations.SentencesAnnotation.class).add(sentence);
            }
            lineNumber += 1;
        }
        PrintWriter xmlOut = new PrintWriter(args[0]+".xml");
        pipeline.xmlPrint(mainAnnotation, xmlOut);
    }
}

现在当我运行这个的时候，句子标签也有了相应的行号。所以这些句子仍然有一个全局 id，但是你可以标记它们来自哪一行。这也将设置它，所以换行符总是结束一个句子。

如果您需要任何说明或者我在转录代码时是否有任何错误，请告诉我。

Answer 2

问题已经得到解答，但我遇到了同样的问题，并想出了一个适合我的命令行解决方案。诀窍是指定 tokenizerFactory 并给它选项 tokenizeNLs=true

看起来像这样：

java -mx1g -cp stanford-corenlp-3.6.0.jar:slf4j-api.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier english.conll.4class.distsim.normal.tagger -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=true" -textFile untagged_lines.txt > tagged_lines.txt

如何在 Stanford CoreNLP 的输出中保留原始行编号？

How to Preserve Original Line Numbering in the Output of Stanford CoreNLP?

nlp

stanford-nlp