如何通过NLTK提取我想要的信息

how to extract information I want by NLKT

我想提取几个主题的相关信息。例如:

第一步,我从其中一个网站提取信息。例如:

i think AIA does a more better life insurance as my comparison and the companies comparisonand most important is also medical insurance in my opinionyes there are some agents that will sell u plans that their commission is high...dun worry u buy insurance from a company anything happens u can contact back the company also can ...better find a agent that is reliable and not just working for the commission for now , they might not service u in the future...thanksregardsdiana ""

然后在VS2015中使用NLTK,尝试了分词

toks = nltk.word_tokenize(text)

通过使用 pos_tag 我可以标记我的 toks

postoks = nltk.tag.pos_tag(toks)

从这部分我不确定我应该怎么做? 以前,我使用 IBM text Analytic。在这个软件中,我用来创建字典,然后创建一些模式,然后分析数据。例如 :

Sample of Dictionary: insurance_cmp : {AIA, IMG, SABB}

Sample of pattern:

insurance_cmp + Good_Feeling_Pattern

insurance_cmp + ['purchase|Buy'] + Bad_Feeling_Pattern

Good_Feeling_Pattern = [good, like it, nice]

Bad_Feeling_Pattern = [bad, worse, not good, regret]

我想知道我可以在 NLKT 中模拟相同的内容吗? chunker 和 create grammar 可以帮助我提取我要找的东西吗?请问您有什么提高自己的想法吗?

grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)

tree = chunker.parse(postoks)

请帮助我实现目标的下一步是什么?

您只需要遵循这些 video

或阅读此 blog