如何创建新实体并使用它在我的测试数据中查找实体?如何使我的标记化工作?

How to create new entity and use it to find the entity in my test data? How to make my tokenize works?

我想创建一个新实体:我们称之为 "medicine",然后使用我的语料库对其进行训练。从那里,识别 "medicine" 的所有实体。不知何故我的代码不起作用,有人可以帮助我吗?

import nltk


test= input("Please enter your file name")
test1= input("Please enter your second file name")

with open(test, "r") as file:  
    new = file.read().splitlines()


with open(test1, "r") as file2:
    new1= file2.read().splitlines()


for s in new:
    for x in new1:
        sample = s.replace('value', x)

        sample1 = ''.join(str(v) for v in sample)

        print(sample1)


        sentences = nltk.sent_tokenize(sample1)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)


        print(sentences)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

How to create new entity and use it to find the entity in my test data?

命名实体识别器是概率、神经或线性模型。在您的代码中,

chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

做这个预测。因此,如果您希望它识别新的实体类型,您应该首先在包含新实体类型的注释数据上训练分类器。

Somehow my code is not working,

之前说过,你没有用自己的数据训练NLTK的模型,所以是不行的。

How to make my tokenize works?

Tokenizer 只提取单词标记,这是在您的代码中通过这一行完成的

tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

但是,tokenizer 不会直接预测命名实体。

如果您想使用 NLTK 训练模型来预测自定义命名实体(如药物),请尝试 this tutorial.

个人经验NLTK可能不适合这个,看Spacy.