Stanford TokensRegex：如何使用 NER 注释的规范化输出设置规范化注释？

Question

我正在创建一个 TokensRegex 注释器来提取建筑物的楼层数（只是一个示例来说明我的问题）。我有一个简单的模式，可以将“4 层楼”和 "four floors" 识别为我的自定义实体 "FLOORS" 的实例。我还想添加一个 NormalizedNER 注释，使用表达式中使用的数字实体的规范化值，但我无法让它按照我想要的方式工作：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

ENV.defaults["ruleType"] = "tokens"

{
  pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
  action: ( Annotate([=10=], ner, "FLOORS"), Annotate([=10=], normalized, $.text) ) 
}

上述规则仅将输出中的 NormalizedNER 字段设置为数字的文本值，分别为上述示例的“4”和 "four"。有没有办法使用 NUMBER 实体的规范化值（“4”和 "four" 均为“4.0”）作为我的 "FLOORS" 实体的规范化值？

提前致谢。

Answer 1

尝试更改

action: ( Annotate([=10=], ner, "FLOORS"), Annotate([=10=], normalized, $.text) )

至

action: ( Annotate([=11=], ner, "FLOORS"), Annotate([=11=], normalized, $.normalized) )

注释接受三个参数

arg1 = 要注释的对象（通常由 $0 指示的匹配标记）
arg2 = 注释字段
arg3 = 值（在这种情况下，您需要 NormalizedNER 字段而不是文本字段）

Answer 2

With $.normalized as you suggested, running on the input "The building has seven floors" yields the following error message: Annotating file test.txt { Error extracting annotation from seven floors }

这可能是因为 $ 表示的令牌的 NamedEntityTagAnnotation 键还不存在。我想，在运行 TokensRegex 之前，您需要确保您的数字标记 - 在本例中为 "four" 或“4” - 具有相应的规范化值 - 在本例中为“4.0” -设置为他们的 NamedEntityTagAnnotation 键。

Also, could you please direct me to where I can find more information on the possible 3rd arguments of Annotate()? Your Javadoc page for TokensRegex expressions doesn't list $$n.normalized (perhaps it needs updating?)

我相信，$$n.normalized 会做的是检索在 Java 代码中相当于 coreLabel.get(edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation.class) 的值，其中 coreLabel属于 CoreLabel 类型，对应于 TokensRegex 中的 $$n。这是因为您的 TokensRegex 中有以下行：normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }

Answer 3

正确答案是根据@AngelChang的回答和评论得出的，为了整洁起见，我只贴在这里。

必须修改规则，因此第二个 Annotate() 操作的第三个参数是 [0].normalized:

{
  pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
  action: ( Annotate([=10=], ner, "FLOORS"), Annotate([=10=], normalized, [0].normalized) ) 
}

根据@Angel 的评论：

[0].normalized is the "normalized" field of the 0th token of the 1st capture group (as a CoreLabel). The $ gives you back the MatchedGroupInfo which has the "text" field but not the normalized field (since that is on the actual token)

Stanford TokensRegex：如何使用 NER 注释的规范化输出设置规范化注释？

Stanford TokensRegex: how to set normalized annotation using normalized output of NER annotation?

stanford-nlp