在将文本传递到 Stanford NER 标记器之前要采取哪些预处理步骤？

Question

最初我遵循了预处理步骤，例如删除停用词、HTML 剥离、删除标点符号。然而，当我不这样做时，NER 似乎表现得更好。谁能告诉我要遵循哪些预处理步骤？

Answer 1

StanfordNER 唯一需要的是干净的文本，干净的意思是，没有 HTML 或任何其他类型的文档元标记。此外，您不应该删除停用词，这些可能有助于模型决定为某个词赋予哪个标签。

只需要一个文本清晰的文件：

echo "Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media." > test_file.txt

然后您将调用 stanford-ner.jar 并向其传递经过训练的模型，例如：classifiers/english.all.3class.distsim.crf.ser.gz 和输入文件，例如：test_file.txt

像这样：

java -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile test_file.txt -outputFormat inlineXML

这应该输出如下内容：

Switzerland LOCATION
,   O
Davos   PERSON
2018    O
:   O
Soros   PERSON
accuses O
Trump   PERSON
of  O
wanting O
a   O
`   O
mafia   O
state   O
'   O
and O
blasts  O
social  O
media   O
.   O

如您所见，您甚至不需要处理标记化（例如，在句子中找到每个唯一的 token/word）StanfordNER 会为您完成。

另一个有用的功能是将 StanfordNER 设置为网络服务：

java -mx2g -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier my_model.ser.gz -textFile -port 9191 -outputFormat inlineXML

然后你可以简单的telnet或者POST一句话a get it back tagged:

telnet 127.0.0.1 9191
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media.

<LOCATION>Switzerland</LOCATION>, <PERSON>Davos</PERSON> 2018: <PERSON>Soros</PERSON> accuses <PERSON>Trump</PERSON> of wanting a 'mafia state' and blasts social media.

Connection closed by foreign host.

在将文本传递到 Stanford NER 标记器之前要采取哪些预处理步骤？

What are the preprocessing steps to be taken before passing text into Stanford NER tagger?

python

nlp

stanford-nlp