如何创建新实体并使用它在我的测试数据中查找实体？如何使我的标记化工作？

Question

我想创建一个新实体：我们称之为 "medicine"，然后使用我的语料库对其进行训练。从那里，识别 "medicine" 的所有实体。不知何故我的代码不起作用，有人可以帮助我吗？

import nltk


test= input("Please enter your file name")
test1= input("Please enter your second file name")

with open(test, "r") as file:  
    new = file.read().splitlines()


with open(test1, "r") as file2:
    new1= file2.read().splitlines()


for s in new:
    for x in new1:
        sample = s.replace('value', x)

        sample1 = ''.join(str(v) for v in sample)

        print(sample1)


        sentences = nltk.sent_tokenize(sample1)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)


        print(sentences)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

Answer 1

How to create new entity and use it to find the entity in my test data?

命名实体识别器是概率、神经或线性模型。在您的代码中，

chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

做这个预测。因此，如果您希望它识别新的实体类型，您应该首先在包含新实体类型的注释数据上训练分类器。

Somehow my code is not working,

之前说过，你没有用自己的数据训练NLTK的模型，所以是不行的。

How to make my tokenize works?

Tokenizer 只提取单词标记，这是在您的代码中通过这一行完成的

tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

但是，tokenizer 不会直接预测命名实体。

如果您想使用 NLTK 训练模型来预测自定义命名实体（如药物），请尝试 this tutorial.

个人经验NLTK可能不适合这个，看Spacy.

如何创建新实体并使用它在我的测试数据中查找实体？如何使我的标记化工作？

How to create new entity and use it to find the entity in my test data? How to make my tokenize works?

python

entity

nlp